Understanding data registries and remotes

lusk · January 19, 2024, 9:13am

Hi all,

I am creating a data registry for my team to enable them to easily pull data down to their projects. The data in the registry should never be directly modified by these projects—it should simply serve as an agnostic store. The data registry data is stored inside the project itself, and is also a DVC/Git-managed project, as data is regularly updated, added, and transformed (again, by actions/pipeline stages managed within the data registry project, not external projects).

So, when it comes to managing the DVC remotes for both the registry and for the separate projects that may use the data…

Does it make sense to add a remote to the registry project itself? In this case it would be a local remote (e.g. dvc remote add this-project /path/to/registry/data). I’m still wrapping my head around local remotes in general, and in particular for “really local” remotes (aka inside same DVC project).
When others want to either dvc get or dvc import data from the data registry, do they need to dvc remote add the data registry? It seems like having it as a remote is not required for others to dvc get or dvc import. Am I correct in assuming that DVC remotes are more for project-specific data version/backup, as opposed to the project-agnostic data registry itself?
If that’s true, and if I have a local DVC remote configured for the registry project, should I be pushing/pulling as I update data, or simply using dvc add or the lock-file produced by pipeline stages?
If I am adding a remote that is also a DVC project, should the remote address (local or external) be pointed to the actual data directory, or simply to the DVC project? E.g. dvc remote add my-project /path/to/project/data vs dvc remote add my-project /path/to/project.

Hopefully the questions are clear, but happy to clarify further if not. I’m a bit new to DVC so any help is appreciated!

dberenbaum · January 19, 2024, 8:13pm

Hi @lusk, is everyone working on a single shared filesystem? If so, you should not technically need a dvc remote unless you want to use it for backups of your data. It sounds like all of this is possible without ever adding a dvc remote to any of the projects, although that means they will not be transferable if you ever want to access them outside of this filesystem.

You might also want to look at sharing the dvc cache (How to Share a Cache Among Projects) so that everyone can share a single copy of the data instead of duplicating it for each user/project.

lusk · January 20, 2024, 5:24pm

Hm, interesting. So far all of the external stakeholders should technically be working in-system, but some may wish to work from their local devices so having that capability would be ideal.

In terms of cache-sharing, my main worry becomes a permissions thing. Given the somewhat steep learning curve of a tool like DVC (at least it has been in my case ), I’d really prefer the registry to be read-only for external users. Is this possible with a shared cache?

dberenbaum · January 22, 2024, 2:43pm

Since the data registry is a Git repo, it should be limited to whoever has access to write to that repo, regardless of whether the cache is read-only.

Topic		Replies	Views
Data (registry) and remote GPU cluster with local DVC repositories Questions	6	715	July 5, 2022
Hi everyone! First question - How to point multiple projects to single dataset? Questions	5	1421	February 17, 2021
Multiple machines setup for one repo Questions	3	1208	October 30, 2020
Dvc_api.get_url is not working with external data Questions	10	1024	June 28, 2022
Large Data Registry on NAS with multiple DVC and non-DVC users Questions	8	885	August 21, 2022

Understanding data registries and remotes

Related topics