I work with simulation models requiring large input datasets and producing similarly large outputs. I’m trying to see if DVC would be an appropriate tool for version control of these datasets. My use case is somewhat different that that of a machine learning developer, I think.
An example would be a global climate model that can be run for different future scenarios of greenhouse gas emissions, or for different regions in the world. Each scenario/region is associated with a specific (set of) input file(s). The names of file or directories containing the data for each scenario may be different.
These input datasets are not (necessarily) tied to specific versions of the code. In principle one could make a Git branch for each scenario, but that would be hassle, and would potentially result in a great many branches. So what I’m looking for is the ability to checkout a specific dataset, without checking out all the datasets for all scenarios, and without having to switch to different Git branch or commit.
Obviously, adding new datasets representing new scenarios would require a Git commit–that’s OK.
I’ve been trying DVC and looking at the documentation but so far I haven’t figured out if this is possible and if so, how do do this.
Thanks in advance,