We have developers working with data shared on a local network, and I’d like to understand whether/how dvc could integrate with this pipeline. I think I’m asking whether its possible (or even makes sense) to have a single, shared cache – kinda like the dvc cloud workflow you describe but without push/pull. The code just reads/writes data outside the git repo.
Another reason I’d like to leave data outside the repo is because many projects have the same (large) dataset as a dependency.
Is the answer hardlinks? Worried 'cause there’s already a lot of linking going on … (Duh. Not across filesystems!)
To be clear, this looks like an awesome tool that I’d like to adapt to if possible.
Great questions, thank you! It totally makes sense, and we are working(https://github.com/dataversioncontrol/dvc/issues/706, https://github.com/dataversioncontrol/dvc/issues/705) on supporting external dependencies/outputs right now(including s3/gcp/sftp/hdfs support, as well as local files outside of repo). The support for those is scheduled for 0.9.8(we have 0.9.7 scheduled for this week, so 0.9.8 should be ready around early June).
You are totally correct about hardlinks, however, in your particular scenario outputs lay outside of your repo, so if the desired cache dir can be placed in the same fs that your outputs are placed, dvc could utilize the most efficient way(i.e. reflink/hardlink/symlink/copy) in the future. Just note that this is not implemented yet, so we still need to figure out particular details.
Cool! Thanks for the info and links. Looking forward to the next couple releases.
For the record:
Shared cache setup: https://dvc.org/doc/use-cases/shared-development-server
- check other related question here, especially related to setting up NAS storage.