I’m fairly new to dvc, but I have worked on a few projects that relied on it, both in scientific and business context. I like the simplicity and transparency of it and I’m looking forward to using it in the future.
The thing I’ve found most cumbersome and yet necessary every time is handling different environments - for testing, experimentation, production, public data, whatever. It ended up either being on different branches named in patterns or in separate directories for the datasets in the main branch, both resulting in messes in different ways. I think it would be appropriate to address this directly in dvc, however I don’t have a clear implementation plan, so I’m mainly looking for ideas here.
let me just drop 2 common use-cases and please share ideas and best practices relating to these:
- running quick integration tests on a small sample of data that validates changes before properly testing them
- maintaining an anonymized public sample of a dataset used in a scientific study for public access and running all pipelines through it
I’m happy to contribute to dvc directly, but I don’t really want to start hacking away without asking for ideas and experiences