Environment support

endremborza · June 8, 2021, 3:28pm

Hello!

I’m fairly new to dvc, but I have worked on a few projects that relied on it, both in scientific and business context. I like the simplicity and transparency of it and I’m looking forward to using it in the future.

The thing I’ve found most cumbersome and yet necessary every time is handling different environments - for testing, experimentation, production, public data, whatever. It ended up either being on different branches named in patterns or in separate directories for the datasets in the main branch, both resulting in messes in different ways. I think it would be appropriate to address this directly in dvc, however I don’t have a clear implementation plan, so I’m mainly looking for ideas here.

let me just drop 2 common use-cases and please share ideas and best practices relating to these:

running quick integration tests on a small sample of data that validates changes before properly testing them
maintaining an anonymized public sample of a dataset used in a scientific study for public access and running all pipelines through it

I’m happy to contribute to dvc directly, but I don’t really want to start hacking away without asking for ideas and experiences

jorgeorpinel · June 8, 2021, 8:46pm

Hi @endremborza

Looks like you’ve already described the 2 most straightforward ways to organize the project: branch-based vs. directory-based.

But you may not even need either one! Have you tried substituting the data (preserving the file name) in each environment and then using dvc repro?

If it’s a test you may not need to commit the results, right? Or to commit the state but keep the data out of storage (anonymous) you can use cache: false in dvc.yaml outs fields (see ref).

jorgeorpinel · June 8, 2021, 8:49pm

Another option (workaround) is to parameterize your dvc.yaml file(s) so you can easily plug in/out different datasets, output names, etc. See dvc.yaml Templating for more info. Example:

# params.yaml
myparams:
  input_fpath: data/in.csv

# dvc.yaml
stages:
  myprocess:
    cmd: ./process.sh ${myparams.input_fpath} labels.txt
    deps:
    - process.sh
    - ${myparams.input_fpath}
    outs:
    - labels.txt

The trick would then be to generate the params.yaml in each environment replacing data/in.csv with different values, e.g. from environment variables. For example setting INPUT_FPATH and using a script like this:

echo "myparams:" > params.yaml
echo "  input_fpath: $INPUT_FPATH" >> params.yaml

jorgeorpinel · June 8, 2021, 8:51pm

p.s. there is a long-standing feature request about environment variables in dvc.yaml, in case you’d like to support it on GH: pipelines: parametrize using environment variables / DVC properties · Issue #1416 · iterative/dvc · GitHub

endremborza · June 8, 2021, 9:22pm

Oh, i did not know about the ${myparams.input_fpath} syntax, that’s nice.

Still all ideas seem a little hacky.

ideally, I would prefer running dvc repro with some --env=test parameter which somehow respects a directory based organization

jorgeorpinel · June 8, 2021, 9:40pm

Ah, that reminds me that you can parameterize like I mentioned, and then use dvc exp run --set-param instead (see ref), specifying each param value as needed (could use env vars there too). E.g.:

$ export IN_FPATH=/mnt/nas/data/sample.csv
$ dvc exp run --set-param myparams.input_fpath=$IN_FPATH

Topic		Replies	Views
Using DVC for non-machine learning models Questions	1	802	October 2, 2020
Processing data in place Questions	8	1424	April 29, 2020
Challenges with non-standard DVC Pipeline Questions	2	90	June 21, 2024
Looking for Workflow Suggestion Questions	2	184	December 21, 2023
Working with DVC with data outside of working dir Questions	0	22	December 13, 2024

Environment support

Related topics