I’m trying to use DVC for managing data produced from simulations. Each simulation produces a large number of small files, and a few very large files. I have worked out how to set up a git repository with DVC and add the data (one .dvc file per simulation), and also how to push it to remote storage. There are too many files to have one .dvc file per file. I think this is known as the “data registry” use case.
I then want to start a new project and work on, say, one of the simulations.
I think what I want to do is to “dvc import” the simulation data directory into the new project. This works, and I can import just a single simulation. However, I would like to be able to get just a small subset of the files within a simulation, without downloading the whole simulation.
I think what I want to be able to do is to “dvc import” but tell it not to actually fetch the data, and then be able to tell it which files under the simulation I want to fetch.
Is such a thing possible?