I am very new to dvc and I am in the process of setting up all existing datasets that we have using dvc, so we can all share the data efficiently in a meaningful way.
The dvc has been setup successfully using network drives as data storage. I have created a repo, which contains several smaller datasets. Each dataset is placed in a subfolder within a directory, which has been pushed to the save location on the network drive using dvc push as one repo.
My question is, is it possible to only pull one dataset (one subfolder) within this repo? Later we will be applying different transformation/data augmentation techniques on these datasets to create new datasets. If we can’t pull subdirectories within the same repo, does this mean we need to create new repos instead?
You can use
dvc pull on subfolders (or individual files) of a DVC-tracked directory. So if you have something like:
dvc add dataset/
You can do:
dvc pull dataset/some/subdir/
to only pull the specific data you are interested in.
Hi I followed your instructions, for example:
dvc add data/
Then I removed by local copy by:
rm -r data/original/dataset1
Then I try to get it back from my remote copy by:
dvc pull data/original/dataset1
The response I get was : Everything is up to date.
But dataset1 was still missing from the data/original directory.
What am I doing wrong?
Can you please run
and then post the output here? It sounds like you might be using an outdated version of DVC
Here it is:
Platform: Python 3.8.12 on Linux-4.15.0-166-generic-x86_64-with-glibc2.17
webhdfs (fsspec = 2022.1.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6)
Cache types: hardlink, symlink
Cache directory: xfs on /dev/sda3
Workspace directory: xfs on /dev/sda3
Repo: dvc, git
Ok, I think I know why. I had to install dvc[gs] package after pip install dvc.
I think the instruction should be a bit clearer from this page: https://dvc.org/doc/install/linux
Or make the error messages a bit clearer. Just a suggestion.
Anyway, thanks for your help.