all the images in clientdata/images are dvc-tracked image files
all the images in clientdata-supplemental/images are dvc-tracked image files
all the images in clientdata-merged/images are dvc-tracked symlinks to the other two image subdirectories
and this way no data is duplicated.
I think that what is happening now is that running dvc add rootdir doesn’t respect the symlinks, and when i dvc pull on another host, i get three subdirectories of images. I’m confident that that is what happens for windows; i am pretty sure it’s true for linux as well.
Am i thinking about things the right way here? Are there other strategies here that i should be considering?
DVC does not track symlinks, it will resolve the link target and then track them as a file (which you have already noticed).
One option here would be to use symlinks as your DVC cache link type, so that DVC would only store the original image files in the .dvc/cache directory, and then all the files in your repo would be symlinks to the cached versions (so only one copy of each file would be stored on disk). See Large Dataset Optimization for more details
Alternatively, you could track clientdata and clientdata-supplemental separately with DVC, and clientdata-merged with Git (since Git handles symlinks the way you are expecting it to), instead of tracking rootdir with DVC. Just note that on Windows, this requires enabling core.symlinks in Git (since not all Windows filesystems support symlinks, see: Git - git-config Documentation)