Why DVC pulls full dataset instead of a single file?

Objectives

I am trying to set up a workflow, where we have data (a directory with binary files) stored and versioned in NAS, and different clients have different options, depending on what they want (and where the experiment is run):

  1. Run pipeline with data downloaded from NAS.
  2. Run pipeline with data mounted (or otherwise retrieved) from NAS, without fully downloading it.
  3. Inspect/add/remove files from data, without downloading the rest of the data

Current Problem

I am new to DVC, so I start with the little steps and still face issues. I set up a local remote, which points to a folder in the mount point of the NAS on my machine with the command:

dvc remote add -d origin /mnt/data/segmentation/dvc

/mnt/data is the mount point of the NAS.

Since I had my initial version of data stored in NAS, I added it with the following command:

dvc add /mnt/data/segmentation/dataset --to-remote

It copied the data

Then, I tried to download a single file from the dataset. I ran the following command:

dvc pull dataset/path/to/single/file.png

But it actually started to download the whole dataset!

Collecting                                     |0.00 [00:00,    ?entry/s]
Fetching
  3%|▎         |Fetching from local            126/4659 [00:08<05:33, 13.59file/s]

What commands should I use instead, to download and then upload the edited file (updating the version of the dataset) without downloading everything else?

Environment

My dvc.config:

[core]
    remote = origin
[cache]
    type = "hardlink,symlink"
['remote "origin"']
    url = /mnt/data/segmentation/dvc/

dvc doctor output:

DVC version: 3.36.1 (pip)
-------------------------
Platform: Python 3.11.6 on Linux-6.6.6-arch1-1-x86_64-with-glibc2.38
Subprojects:
        dvc_data = 3.3.0
        dvc_objects = 3.0.0
        dvc_render = 1.0.0
        dvc_task = 0.3.0
        scmrepo = 2.0.2
Supports:
        gs (gcsfs = 2023.12.2.post1),
        http (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.1, aiohttp-retry = 2.8.3)
Config:
        Global: /home/andrew/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p2
Caches: local
Remotes: local
Workspace directory: ext4 on /dev/nvme0n1p2
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/e982b2365151c5553fc93e6ac9ceb1bc

Future problem

I also would like to ask about the second point. Would it be possible to have such an option at all? It appears that DVC stores data in a quite specific format (when I manually inspected the content of /mnt/data/segmentation/dvc/), so just mounting NAS wouldn’t be enough. I also found out an API fnction, but it would require writing specific adapters to load files dynamically during training/inference, which is not desirable, but okay if this is really the only option. I also have suspicions regarding the performance of this approach.

Hi, @async_andrew. This looks like a regression in dvc>=3.0.0. I have opened an issue in partial fetch is broken · Issue #10199 · iterative/dvc · GitHub. Please subscribe to it.

Thanks.

Thank you for the information and for reproducing the bug for me!

Apart from partial downloading, could please advise me on my second usage scenario. What is the best way to achieve it with DVC?

Hi. Regarding the 2nd point, yes, that’s possible. DVC has a concept of external dependency. See External Dependencies and Outputs.

Inspect/add/remove files from data, without downloading the rest of the data

DVC does support all of these operations without downloading. See Modifying Large Datasets.

You can append a file or a directory to an existing dataset. You can also remove a file or a directory. But there’s a regression, which will hopefully be fixed soon.

You can inspect with dvc ls command.

1 Like

Thank you for the answer!

The section mentions how to link with external dependencies, but what if I want to address the versioned external dependency (aka dvc remote) without downloading it? It seems that it is only possible with cloud versioning, which is obviously unavailable for NAS drives, am I correct?

If you are using the DVC python API it’s possible to stream the specific version of a file tracked by DVC using dvc.api.open(..., rev=...) and dvc.api.read(..., rev=...) (where rev is the git revision for the version you want to use)

Otherwise, you can also use dvc get --show-url --rev=... from the command line to show the remote URL for the specific version of the file you want (and then pass that path into whatever other tool/script you need). For a local remote, the URL in this case would just be a path within your local mount point for the NAS drive.

1 Like

Hi @async_andrew. Would you be able to test fix partial fetch by skshetry · Pull Request #10205 · iterative/dvc · GitHub and see if partial pull works for you or not?

You can use the following command to install it:

pip install git+https://github.com/iterative/dvc@refs/pull/10205/merge

Hi @skshetry. I have just checked, and partial pull works fine. Thank you!

1 Like

Hi @pmrowla. What should I specify as a dependency (in dvc.yaml), for a stage that uses DVC API to get data from dvc remote so that it is tracked properly?

DVC doesn’t really support that use case. If you have a dependency on a DVC tracked file from another DVC repository, you would normally use dvc import to track that dependency instead.