I stumbled upon dvc over the weekend after discussing with my colleagues our need for just such a thing. I got it working in some test repos with data on our private s3 buckets. Very cool! I really appreciate the documentation, it’s very well put-together.
One thing I’m not sure of is:
If I have a dataset on a remote server and I have several different repos/projects that use it, what is the appropriate way to point them all to that dataset and/or a single dvc remote representation thereof? (Do I need in each repo/project to download it, then dvc add it, and assign the dataset to the same remote?)
Hi @stauntonjr !
One way to handle it would be to put it into a data registry and then just
dvc import it in other projects https://dvc.org/doc/use-cases/data-registries . Would that work for you?
Hi, I have the same problem.
if I use data registry as you suggested, and then
dvc import it to the other projects,
the data will be duplicated and downloaded to each project? so I will have multiple copies of the same data in different projects, in the same server?
dvc imported data is not affected by
dvc push, so it will only be stored in the data registry remote with no duplication.
what do you mean by " is not affected by
I have a data registry, and another code project.
dvc import in the code project, the data is downloaded + its
.dvc file to the project folder. this data doesn’t take space in my server? is it use links?
dvc import, the data is downloaded locally, it’s
.dvc is created, but
dvc push won’t push it to remote. So next time when you run
dvc pull, it will download it from the data registry it was
dvc imported from, so it won’t take space in this project’s remote.