Symlink semantics in tracked directories

What are the semantics of symlinks inside a directory tree i track all at once with dvc add <some directory>?

I have some COCO image datasets of this form:

rootdir/
├── clientdata
│   ├── images
|   |-- dataset-val.json
│   └── dataset-train.json
├── clientdata-merged
│   ├── images
|   |-- dataset-merged-val.json
│   └── dataset-merged-train.json
└── clientdata-supplemental
|     |-- dataset-supplemental-val.json
│     └── dataset-supplemental-train.json
|     └── images

And what i would like to have is this situation:

  • all the images in clientdata/images are dvc-tracked image files
  • all the images in clientdata-supplemental/images are dvc-tracked image files
  • all the images in clientdata-merged/images are dvc-tracked symlinks to the other two image subdirectories

and this way no data is duplicated.

I think that what is happening now is that running dvc add rootdir doesn’t respect the symlinks, and when i dvc pull on another host, i get three subdirectories of images. I’m confident that that is what happens for windows; i am pretty sure it’s true for linux as well.

Am i thinking about things the right way here? Are there other strategies here that i should be considering?

DVC does not track symlinks, it will resolve the link target and then track them as a file (which you have already noticed).

One option here would be to use symlinks as your DVC cache link type, so that DVC would only store the original image files in the .dvc/cache directory, and then all the files in your repo would be symlinks to the cached versions (so only one copy of each file would be stored on disk). See Large Dataset Optimization for more details

Alternatively, you could track clientdata and clientdata-supplemental separately with DVC, and clientdata-merged with Git (since Git handles symlinks the way you are expecting it to), instead of tracking rootdir with DVC. Just note that on Windows, this requires enabling core.symlinks in Git (since not all Windows filesystems support symlinks, see: Git - git-config Documentation)

Ok, i think this is the way forward.

Thanks for your prompt help! DVC has been immediately useful out of the box for me in ways that things like git-lfs never were.

I can handle the cross-platform oddities easily enough now that i know what’s going on.