Symlink semantics in tracked directories

gd-gv · May 2, 2023, 8:40pm

What are the semantics of symlinks inside a directory tree i track all at once with dvc add <some directory>?

I have some COCO image datasets of this form:

rootdir/
├── clientdata
│   ├── images
|   |-- dataset-val.json
│   └── dataset-train.json
├── clientdata-merged
│   ├── images
|   |-- dataset-merged-val.json
│   └── dataset-merged-train.json
└── clientdata-supplemental
|     |-- dataset-supplemental-val.json
│     └── dataset-supplemental-train.json
|     └── images

And what i would like to have is this situation:

all the images in clientdata/images are dvc-tracked image files
all the images in clientdata-supplemental/images are dvc-tracked image files
all the images in clientdata-merged/images are dvc-tracked symlinks to the other two image subdirectories

and this way no data is duplicated.

I think that what is happening now is that running dvc add rootdir doesn’t respect the symlinks, and when i dvc pull on another host, i get three subdirectories of images. I’m confident that that is what happens for windows; i am pretty sure it’s true for linux as well.

Am i thinking about things the right way here? Are there other strategies here that i should be considering?

pmrowla · May 3, 2023, 5:54am

DVC does not track symlinks, it will resolve the link target and then track them as a file (which you have already noticed).

One option here would be to use symlinks as your DVC cache link type, so that DVC would only store the original image files in the .dvc/cache directory, and then all the files in your repo would be symlinks to the cached versions (so only one copy of each file would be stored on disk). See Large Dataset Optimization for more details

Alternatively, you could track clientdata and clientdata-supplemental separately with DVC, and clientdata-merged with Git (since Git handles symlinks the way you are expecting it to), instead of tracking rootdir with DVC. Just note that on Windows, this requires enabling core.symlinks in Git (since not all Windows filesystems support symlinks, see: Git - git-config Documentation)

gd-gv · May 3, 2023, 12:44pm

Ok, i think this is the way forward.

Thanks for your prompt help! DVC has been immediately useful out of the box for me in ways that things like git-lfs never were.

I can handle the cross-platform oddities easily enough now that i know what’s going on.

Topic		Replies	Views
Multiple cache types Questions	2	278	September 15, 2023
Data directory not tracked by git Questions	6	786	March 14, 2023
Tracking data and code dependencies Questions	4	2136	May 18, 2018
Clone of repo with symlinked files creates copies not links Questions	0	451	August 25, 2021
First steps with DVC, a few questions Questions	2	68	September 20, 2024

Symlink semantics in tracked directories

Related topics