Can and how DVC work with shared network storage?

Ian · May 16, 2018, 3:28am

We have developers working with data shared on a local network, and I’d like to understand whether/how dvc could integrate with this pipeline. I think I’m asking whether its possible (or even makes sense) to have a single, shared cache – kinda like the dvc cloud workflow you describe but without push/pull. The code just reads/writes data outside the git repo.

Another reason I’d like to leave data outside the repo is because many projects have the same (large) dataset as a dependency.

~~Is the answer hardlinks? Worried 'cause there’s already a lot of linking going on …~~ (Duh. Not across filesystems!)

To be clear, this looks like an awesome tool that I’d like to adapt to if possible.

kupruser · May 16, 2018, 9:08am

Hi Ian!

Great questions, thank you! It totally makes sense, and we are working(https://github.com/dataversioncontrol/dvc/issues/706, https://github.com/dataversioncontrol/dvc/issues/705) on supporting external dependencies/outputs right now(including s3/gcp/sftp/hdfs support, as well as local files outside of repo). The support for those is scheduled for 0.9.8(we have 0.9.7 scheduled for this week, so 0.9.8 should be ready around early June).

You are totally correct about hardlinks, however, in your particular scenario outputs lay outside of your repo, so if the desired cache dir can be placed in the same fs that your outputs are placed, dvc could utilize the most efficient way(i.e. reflink/hardlink/symlink/copy) in the future. Just note that this is not implemented yet, so we still need to figure out particular details.

Cheers,
Ruslan

Ian · May 16, 2018, 2:44pm

Cool! Thanks for the info and links. Looking forward to the next couple releases.

shcheklein · January 1, 2020, 9:34pm

For the record:

Shared cache setup: https://dvc.org/doc/use-cases/shared-development-server

check other related question here, especially related to setting up NAS storage.

Topic		Replies	Views
Total data usage and use of external shared cache Questions	1	392	May 26, 2023
Direct copy between Shared Cache and External Dependencies/Outputs Questions	10	1851	June 3, 2021
Shared cache directory Questions	14	3163	July 5, 2018
Single cache or multiple caches in NAS with External Data Questions	2	1246	May 26, 2022
Shared cache with huge repo Questions	1	30	July 15, 2025

Can and how DVC work with shared network storage?

Related topics