Working with a small subset of remote data

Hi,

I’m trying to use DVC for managing data produced from simulations. Each simulation produces a large number of small files, and a few very large files. I have worked out how to set up a git repository with DVC and add the data (one .dvc file per simulation), and also how to push it to remote storage. There are too many files to have one .dvc file per file. I think this is known as the “data registry” use case.

I then want to start a new project and work on, say, one of the simulations.
I think what I want to do is to “dvc import” the simulation data directory into the new project. This works, and I can import just a single simulation. However, I would like to be able to get just a small subset of the files within a simulation, without downloading the whole simulation.

I think what I want to be able to do is to “dvc import” but tell it not to actually fetch the data, and then be able to tell it which files under the simulation I want to fetch.

Is such a thing possible?

1 Like

Hi @ianhinder

Very interesting use case! How do you see this command? Will something like --no-pull option work for you:

dvc import --no-fetch https://github.com/iterative/dataset-registry use-cases/cats-dogs/
dvc pull \
    cats-dogs/data/train/dogs/dog.100.jpg \
    cats-dogs/data/train/dogs/dog.101.jpg \
    cats-dogs/data/train/dogs/dog.102.jpg

Please let me know what other options you have in mind.

Yes, that’s the sort of thing I had in mind. It would also be good to somehow specify what I want to download beyond just a list of filenames. For example, including wildcards, or maybe more advanced brace-expansion, such as

dvc pull cats-dogs/data/train/{dogs,cats}/**/*.img

though I appreciate that if this is not yet somewhere in the code, it’s quite a big project to design and add. Possibly something like rsync’s filter rules? Again quite big to add. But being able to specify the subsets without having to know exactly the filenames to download would be good, if there is a way to do that, so at least simple wildcards.

With your idea, would we be able to have the simulation committed as a directory (i.e. one .dvc file), and then pull different parts under that? I had the impression that if you try to do this at the moment, it pulls the whole directory.

--no-fetch - should be easy to implement and a partial pull (per file) can be used. However, the default pull (dvc pull or dvc pull cats-dogs) will still be downloading all the files. We need to implement a proper pattern mechanism to solve it. Below are the action points.

Implement --no-fetch

--no-fetch might be useful for this scenario with a following partial, per-file pull or for a repository metadata change without downloading. I created an issue https://github.com/iterative/dvc/issues/4815

Workaround

For now, as a workaround, you can create the dvc-file manually. The only tricky part here is the checksums that you will need to find and copy from the original repo: rev_lock and outs.md5

md5: ''
frozen: true
deps:
- path: use-cases/cats-dogs
  repo:
    url: https://github.com/iterative/dataset-registry
    rev_lock: f31f5c4cdae787b4bdeb97a717687d44667d9e62 # use master HEAD checksum from the original repo
outs:
- md5: 22e3f61e52c0ba45334d973244efc155.dir # find it in the dvc file in the original repo
  path: cats-dogs

Implement import/pull by pattern

A proper solution will require to define a pattern in import. dvc pull should also support this pattern. The wildcard is one pattern.

I’d also include more advance patters link date dvc pull users/%Y/%m/%d/users.csv?startdata=2020-09-01,enddate=now,ignoremissing

Created another ticket: https://github.com/iterative/dvc/issues/4816

1 Like