[I posted this question in the #dvc Discord channel, and @shcheklein suggested that other users can benefit from the discussion, so I’m copying it here]
Some time ago I asked in the forum about using DVC to manage our data on a NAS drive (Single cache or multiple caches in NAS with External Data). I got useful advice from @pmrowla , thanks, who suggested setting up a Data Registry. I did, but we are having trouble with users getting permission errors to add new datasets.
The problem is that we have some constraints that go a bit against how DVC seems to be designed for. Namely, we have a folder datasets
on the NAS with large datasets, both in size and number of files. Multiple users need to be able to save directories/files to this folder, and then add/commit/push them with DVC. We cannot have data duplication either, and the directories/files in folder datasets
need to remain there, for non-DVC users to read them.
So, our current solution is that the NAS is mounted on a server at /nas/
, and looks like this:
nas
├── dvc_cache
└── datasets
├── .dvc
├── .git
├── dataset_1
└── dataset_2
with a configuration file (hardlinks/symlinks to avoid data duplication)
[cache]
dir = /nas/dvc_cache/
shared = group
type = “hardlink,symlink”
[core]
autostage = true
But an immediate problem is that if user A does dvc add dataset_1
, and user B does dvc add dataset_2
, then user A can git commit -a
and git push
and commit/push both datasets.
Another issue we are having is that it looks like if user_A
adds a dataset, the directory created in the cache, e.g. /nas/dvc_cache/e0
, instead of having owner user_A
and group everybody
, it has both owner and group user_A
, which blocks for anybody else any dvc operations that involve /nas/dvc_cache/e0
.
My feeling is that each user should have their clone of the Data Registry on /nas
, but I’m not sure this works with the Data Registry idea. Suggestions would be very welcome, thanks!