Hi all,
I am creating a data registry for my team to enable them to easily pull data down to their projects. The data in the registry should never be directly modified by these projects—it should simply serve as an agnostic store. The data registry data is stored inside the project itself, and is also a DVC/Git-managed project, as data is regularly updated, added, and transformed (again, by actions/pipeline stages managed within the data registry project, not external projects).
So, when it comes to managing the DVC remotes for both the registry and for the separate projects that may use the data…
-
Does it make sense to add a remote to the registry project itself? In this case it would be a local remote (e.g.
dvc remote add this-project /path/to/registry/data
). I’m still wrapping my head around local remotes in general, and in particular for “really local” remotes (aka inside same DVC project). -
When others want to either
dvc get
ordvc import
data from the data registry, do they need todvc remote add
the data registry? It seems like having it as a remote is not required for others todvc get
ordvc import
. Am I correct in assuming that DVC remotes are more for project-specific data version/backup, as opposed to the project-agnostic data registry itself? -
If that’s true, and if I have a local DVC remote configured for the registry project, should I be pushing/pulling as I update data, or simply using
dvc add
or the lock-file produced by pipeline stages? -
If I am adding a remote that is also a DVC project, should the remote address (local or external) be pointed to the actual data directory, or simply to the DVC project? E.g.
dvc remote add my-project /path/to/project/data
vsdvc remote add my-project /path/to/project
.
Hopefully the questions are clear, but happy to clarify further if not. I’m a bit new to DVC so any help is appreciated!