Using DVC for non-machine learning models

Hi all,

I work with simulation models requiring large input datasets and producing similarly large outputs. I’m trying to see if DVC would be an appropriate tool for version control of these datasets. My use case is somewhat different that that of a machine learning developer, I think.

An example would be a global climate model that can be run for different future scenarios of greenhouse gas emissions, or for different regions in the world. Each scenario/region is associated with a specific (set of) input file(s). The names of file or directories containing the data for each scenario may be different.

These input datasets are not (necessarily) tied to specific versions of the code. In principle one could make a Git branch for each scenario, but that would be hassle, and would potentially result in a great many branches. So what I’m looking for is the ability to checkout a specific dataset, without checking out all the datasets for all scenarios, and without having to switch to different Git branch or commit.

Obviously, adding new datasets representing new scenarios would require a Git commit–that’s OK.

I’ve been trying DVC and looking at the documentation but so far I haven’t figured out if this is possible and if so, how do do this.

Thanks in advance,

Maarten

1 Like

Hello Maarten, welcome to the community!

As to your question:
This use case seems to match the data registry use case.

Some of our users like to have multiple datasets in one place - so they are combining DVC and git and create single repository that only purpose is to version their datasets. In that case, you probably would create the repository having all datasets inside, each one can be, for example, specific dir, like data-cats-dogs, data-other - you keep all the datasets on master branch and add new datasets as new commits. That pretty much it when it comes to storage.

As to how to use it:
When you need a specific dataset, you can leverage dvc get and dvc import to obtain particular dataset to your current project.

dvc import is intended to work with dvc repositories, it creates .dvc files so that you can later save a information about particular data used in your other project, and dvc get is intended to just obtain the data and leave it as is, so you can think about it as wget for dvc controlled data.

Usage for import and get:

  • dvc import {data_registry_url} {path_to_data_from_registry_root}
  • dvc get {data_registry_url} {path_to_data_from_registry_root}

optionally, you can specify target revision with --rev option.

So, essentialy you end up with single place to store your data, and create new projects when you want to actually use it.

Do you think this could help you with your use case?
Best regards,
Paweł

PS, there is more info on data registries here: https://dvc.org/doc/use-cases/data-registries