Same remote for multiple repos

Hi

I was wondering how the remote mechanisms work.
Assuming I have two different DVC repos, that may use the same base dataset.
Can I use the same remote for both of them, so the dataset is not stored twice?

This is strictly speaking not possible because of hashing collisions, is it?

Regards
Matthias

1 Like

Hi @ynop !

Thank you for a great question! Sure, you can do that. If some file has a hash 123456 in one project, it is going to have the same hash in another project. Strictly speaking hash collisions are always a possibility simply because of the nature of hashes, but, as it is in many other scenarios, in our case they are pretty unlikely, since the data is pretty big and the number of data files is quite small when compared with git for example. So you are safe using the same repo to store data for different projects.

Thanks,
Ruslan

1 Like

To take the question a bit further…

Let’s say I have 2 git/dvc repositories: One creates versions of a remote stored artifact (such as an ML model) & the second just clones this same remote artifact to apply it (such as to get ML predictions). Is it possible to make repo2 track the latest version made by repo1?

I can see how git committing repo2/artifact.dvc as a symbolic link to repo1/artifact.dvc would achieve this, but wondering if there’s a dvc native way that’s platform independent.

To sync the two repo automatically? Because only the data files are tracked by dvc, all others including dvc configuration files are actually tracked by Git. So maybe what you need is to push after every git commit. If so a Git hook can help you achieve the goal.

Right, I should have mentioned that the desired tracking in repo2 does not need to be on every commit of repo1 but could be more sporadically & asynchronous, say on major releases of either repo.

The question is, how can one make repo2's remote artifact dependency on repo1 automatically aware of any new artifact versions whenever repo2 is ready to pull the latests? Namely, how can I make repo2/artifact.dvc pointer file match that of repo1/artifact.dvc?

I suspect creating the repo2/artifact.dvc via the dvc import command is the way to do this?

Sorry for the late reply, dvc import of course is an option especially when repo1 and repo2 didn’t have the same contents. But you still need to use dvc update manually to make the data version synchronized.

Indeed: dvc import initially & later dvc update were just the commands I was looking for. Thanks!

1 Like