Environment support

Hello!

I’m fairly new to dvc, but I have worked on a few projects that relied on it, both in scientific and business context. I like the simplicity and transparency of it and I’m looking forward to using it in the future.

The thing I’ve found most cumbersome and yet necessary every time is handling different environments - for testing, experimentation, production, public data, whatever. It ended up either being on different branches named in patterns or in separate directories for the datasets in the main branch, both resulting in messes in different ways. I think it would be appropriate to address this directly in dvc, however I don’t have a clear implementation plan, so I’m mainly looking for ideas here.

let me just drop 2 common use-cases and please share ideas and best practices relating to these:

  • running quick integration tests on a small sample of data that validates changes before properly testing them
  • maintaining an anonymized public sample of a dataset used in a scientific study for public access and running all pipelines through it

I’m happy to contribute to dvc directly, but I don’t really want to start hacking away without asking for ideas and experiences

1 Like

Hi @endremborza

Looks like you’ve already described the 2 most straightforward ways to organize the project: branch-based vs. directory-based.

But you may not even need either one! Have you tried substituting the data (preserving the file name) in each environment and then using dvc repro?

If it’s a test you may not need to commit the results, right? Or to commit the state but keep the data out of storage (anonymous) you can use cache: false in dvc.yaml outs fields (see ref).

Another option (workaround) is to parameterize your dvc.yaml file(s) so you can easily plug in/out different datasets, output names, etc. See dvc.yaml Templating for more info. Example:

# params.yaml
myparams:
  input_fpath: data/in.csv
# dvc.yaml
stages:
  myprocess:
    cmd: ./process.sh ${myparams.input_fpath} labels.txt
    deps:
    - process.sh
    - ${myparams.input_fpath}
    outs:
    - labels.txt

The trick would then be to generate the params.yaml in each environment replacing data/in.csv with different values, e.g. from environment variables. For example setting INPUT_FPATH and using a script like this:

echo "myparams:" > params.yaml
echo "  input_fpath: $INPUT_FPATH" >> params.yaml

p.s. there is a long-standing feature request about environment variables in dvc.yaml, in case you’d like to support it on GH: pipelines: parametrize using environment variables / DVC properties · Issue #1416 · iterative/dvc · GitHub

Oh, i did not know about the ${myparams.input_fpath} syntax, that’s nice.

Still all ideas seem a little hacky.

ideally, I would prefer running dvc repro with some --env=test parameter which somehow respects a directory based organization

Ah, that reminds me that you can parameterize like I mentioned, and then use dvc exp run --set-param instead (see ref), specifying each param value as needed (could use env vars there too). E.g.:

$ export IN_FPATH=/mnt/nas/data/sample.csv
$ dvc exp run --set-param myparams.input_fpath=$IN_FPATH

See also Experiment Management.

that’s getting a lot nicer :slight_smile:

so a test case can be treated as an experiment, that is a great approach to one of the cases

the public data instance fits a little less, still

1 Like

This sounds like a good scenario for the data registry pattern.