Hi everyone! First question - How to point multiple projects to single dataset?

I stumbled upon dvc over the weekend after discussing with my colleagues our need for just such a thing. I got it working in some test repos with data on our private s3 buckets. Very cool! I really appreciate the documentation, it’s very well put-together.

One thing I’m not sure of is:

If I have a dataset on a remote server and I have several different repos/projects that use it, what is the appropriate way to point them all to that dataset and/or a single dvc remote representation thereof? (Do I need in each repo/project to download it, then dvc add it, and assign the dataset to the same remote?)

thanks!
Rory

1 Like

Hi @stauntonjr !

One way to handle it would be to put it into a data registry and then just dvc import it in other projects https://dvc.org/doc/use-cases/data-registries . Would that work for you?

Hi, I have the same problem.
if I use data registry as you suggested, and then dvc import it to the other projects,
the data will be duplicated and downloaded to each project? so I will have multiple copies of the same data in different projects, in the same server?
thanks!
Naama

Hi.

dvc imported data is not affected by dvc push, so it will only be stored in the data registry remote with no duplication.

thanks,
what do you mean by " is not affected by dvc push?
I have a data registry, and another code project.
after using dvc import in the code project, the data is downloaded + its .dvc file to the project folder. this data doesn’t take space in my server? is it use links?

After using dvc import, the data is downloaded locally, it’s .dvc is created, but dvc push won’t push it to remote. So next time when you run dvc pull, it will download it from the data registry it was dvc imported from, so it won’t take space in this project’s remote. :slightly_smiling_face: