I’m fairly new to dvc, but I have worked on a few projects that relied on it, both in scientific and business context. I like the simplicity and transparency of it and I’m looking forward to using it in the future.
The thing I’ve found most cumbersome and yet necessary every time is handling different environments - for testing, experimentation, production, public data, whatever. It ended up either being on different branches named in patterns or in separate directories for the datasets in the main branch, both resulting in messes in different ways. I think it would be appropriate to address this directly in dvc, however I don’t have a clear implementation plan, so I’m mainly looking for ideas here.
let me just drop 2 common use-cases and please share ideas and best practices relating to these:
running quick integration tests on a small sample of data that validates changes before properly testing them
maintaining an anonymized public sample of a dataset used in a scientific study for public access and running all pipelines through it
I’m happy to contribute to dvc directly, but I don’t really want to start hacking away without asking for ideas and experiences
Looks like you’ve already described the 2 most straightforward ways to organize the project: branch-based vs. directory-based.
But you may not even need either one! Have you tried substituting the data (preserving the file name) in each environment and then using dvc repro?
If it’s a test you may not need to commit the results, right? Or to commit the state but keep the data out of storage (anonymous) you can use cache: false in dvc.yaml outs fields (see ref).
Another option (workaround) is to parameterize your dvc.yaml file(s) so you can easily plug in/out different datasets, output names, etc. See dvc.yaml Templating for more info. Example:
The trick would then be to generate the params.yaml in each environment replacing data/in.csv with different values, e.g. from environment variables. For example setting INPUT_FPATH and using a script like this:
Ah, that reminds me that you can parameterize like I mentioned, and then use dvc exp run --set-param instead (see ref), specifying each param value as needed (could use env vars there too). E.g.:
$ export IN_FPATH=/mnt/nas/data/sample.csv
$ dvc exp run --set-param myparams.input_fpath=$IN_FPATH