Objectives
I am trying to set up a workflow, where we have data (a directory with binary files) stored and versioned in NAS, and different clients have different options, depending on what they want (and where the experiment is run):
- Run pipeline with data downloaded from NAS.
- Run pipeline with data mounted (or otherwise retrieved) from NAS, without fully downloading it.
- Inspect/add/remove files from data, without downloading the rest of the data
Current Problem
I am new to DVC, so I start with the little steps and still face issues. I set up a local remote, which points to a folder in the mount point of the NAS on my machine with the command:
dvc remote add -d origin /mnt/data/segmentation/dvc
/mnt/data
is the mount point of the NAS.
Since I had my initial version of data stored in NAS, I added it with the following command:
dvc add /mnt/data/segmentation/dataset --to-remote
It copied the data
Then, I tried to download a single file from the dataset. I ran the following command:
dvc pull dataset/path/to/single/file.png
But it actually started to download the whole dataset!
Collecting |0.00 [00:00, ?entry/s]
Fetching
3%|▎ |Fetching from local 126/4659 [00:08<05:33, 13.59file/s]
What commands should I use instead, to download and then upload the edited file (updating the version of the dataset) without downloading everything else?
Environment
My dvc.config
:
[core]
remote = origin
[cache]
type = "hardlink,symlink"
['remote "origin"']
url = /mnt/data/segmentation/dvc/
dvc doctor
output:
DVC version: 3.36.1 (pip)
-------------------------
Platform: Python 3.11.6 on Linux-6.6.6-arch1-1-x86_64-with-glibc2.38
Subprojects:
dvc_data = 3.3.0
dvc_objects = 3.0.0
dvc_render = 1.0.0
dvc_task = 0.3.0
scmrepo = 2.0.2
Supports:
gs (gcsfs = 2023.12.2.post1),
http (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
https (aiohttp = 3.9.1, aiohttp-retry = 2.8.3)
Config:
Global: /home/andrew/.config/dvc
System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p2
Caches: local
Remotes: local
Workspace directory: ext4 on /dev/nvme0n1p2
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/e982b2365151c5553fc93e6ac9ceb1bc
Future problem
I also would like to ask about the second point. Would it be possible to have such an option at all? It appears that DVC stores data in a quite specific format (when I manually inspected the content of /mnt/data/segmentation/dvc/
), so just mounting NAS wouldn’t be enough. I also found out an API fnction, but it would require writing specific adapters to load files dynamically during training/inference, which is not desirable, but okay if this is really the only option. I also have suspicions regarding the performance of this approach.