Large Data Registry on NAS with multiple DVC and non-DVC users

rcasero · August 10, 2022, 8:46am

[I posted this question in the #dvc Discord channel, and @shcheklein suggested that other users can benefit from the discussion, so I’m copying it here]

Some time ago I asked in the forum about using DVC to manage our data on a NAS drive (Single cache or multiple caches in NAS with External Data). I got useful advice from @pmrowla , thanks, who suggested setting up a Data Registry. I did, but we are having trouble with users getting permission errors to add new datasets.

The problem is that we have some constraints that go a bit against how DVC seems to be designed for. Namely, we have a folder datasets on the NAS with large datasets, both in size and number of files. Multiple users need to be able to save directories/files to this folder, and then add/commit/push them with DVC. We cannot have data duplication either, and the directories/files in folder datasets need to remain there, for non-DVC users to read them.

So, our current solution is that the NAS is mounted on a server at /nas/, and looks like this:

nas
  ├──  dvc_cache
  └──  datasets
           ├── .dvc
           ├── .git
           ├── dataset_1
           └── dataset_2

with a configuration file (hardlinks/symlinks to avoid data duplication)

[cache]
dir = /nas/dvc_cache/
shared = group
type = “hardlink,symlink”
[core]
autostage = true

But an immediate problem is that if user A does dvc add dataset_1, and user B does dvc add dataset_2, then user A can git commit -a and git push and commit/push both datasets.

Another issue we are having is that it looks like if user_A adds a dataset, the directory created in the cache, e.g. /nas/dvc_cache/e0, instead of having owner user_A and group everybody, it has both owner and group user_A, which blocks for anybody else any dvc operations that involve /nas/dvc_cache/e0.

My feeling is that each user should have their clone of the Data Registry on /nas, but I’m not sure this works with the Data Registry idea. Suggestions would be very welcome, thanks!

rcasero · August 10, 2022, 8:47am

[@shcheklein 's reply pasted here]

Hey, @rcasero . It’s a good scenario, let’s try to figure out something. I’m glad to see that you are using this advanced stuff in some interesting way.

My feeling is that each user should have their clone of the Data Registry on /nas

I think that goes into the right direction. I would even say, it should be in their home directories. I don’t see any reason for keeping it on NAS, moreover git and DVC on NAS do not work well usually. You setup should be enough to git clone into a personal directory, checkout a specific dataset (via symlinks in this case), add more files and do dvc add or dvc commit + git push.

We cannot have data duplication either, and the directories/files in folder datasets need to remain there, for non-DVC users to read them.

That’s an interesting requirement, and yes you can keep it then and do git pull + dvc pull periodically I think. Alternative for end users is to rely on dvc import or dvc get in their projects. This is the best way to get dataset right into their workspace . Multiple users can even get different versions simultaneously w/o affecting each other. Curious about you thoughts on this workflow. (edited)

[19:34]

@rcasero after this dicussion is done, we should probably update the discussion forum so that other folks can benefit …

rcasero · August 10, 2022, 2:16pm

@shcheklein Thanks for the suggestions. I have some feedback, and a minor clarification. The clarification is that many datasets in /nas/datasets won’t be under DVC. It’s only when we need to process a dataset, that we put it under DVC. At that point, we expect that DVC will make that dataset read-only, and although it can still be read by non-DVC users, only DVC users can modify or delete it.

I would even say, it should be in their home directories. I don’t see any reason for keeping it on NAS, moreover git and DVC on NAS do not work well usually. You setup should be enough to git clone into a personal directory, checkout a specific dataset (via symlinks in this case), add more files and do dvc add or dvc commit + git push.

The reasons to keep the user’s Data Registry clones on the NAS in our case would be, in my opinion:

There’s not enough space in the home directories for our datasets. If the Data Registry clone is on server:/home/user_A/data_registry/, and the user wants to add new files or a new dataset, they first need to be copied to the home directory, but they won’t fit there.
Even if there were enough space, this would be a bit inefficient. First, a copy needs to be made to the server:/home/user_A/data_registry/ filesystem. Then, when the dataset is added to DVC, dvc needs to move the files to the DVC cache on the NAS filesystem, which takes time.

So, instead, to me it makes sense to have something like this

nas
├── data_registry_user_A
├── data_registry_user_B
├── data_registry_user_C
├── datasets
│   ├── dataset_1
│   └── dataset_2
└── dvc_cache

The workflow would be

Data is originally copied to the NAS in /nas/datasets.
If we need to process a dataset, it can be quickly moved to a user’s data registry, e.g. /nas/datasets/data_registry_user_A (mv /nas/datasets/dataset_1 /nas/datasets/data_registry_user_A) because both directories are on the same filesystem.
The user then does cd /nas/datasets/data_registry_user_A and dvc add dataset_1 and commits. This should create hardlinks in the cache to the data files. Because hardlinks point to the same inode on the filesystem, it avoids data duplication or moving data around.
The user then goes to the general registry cd /nas/datasets and then do git pull + dvc pull as you suggest. This should create more hardlinks, again avoiding duplication, and also allowing non-DVC users to access the data files transparently. (Note: But we expect that dvc will make the files read-only).
Finally, to process the data, the user would create a git project on the server, server:/home/user_A/processing_of_dataset_1/, and then dvc import the dataset from the Data Registry, as you suggest. In this case, symlinks are created across filesystems, but that’s fast, and the data doesn’t take up space.

Any thoughts? I’m going to give it a try, and report how it goes.

shcheklein · August 10, 2022, 6:20pm

@rcasero thanks for moving it here . I hope it’ll be useful for a lot of people.

Let me first try brainstorm with you on this, and then we can discuss the workflow you suggested (that looks like overlaps with what I’m about to write). If you see any discrepancies with what you have in mind, please let me know.

There’s not enough space in the home directories for our datasets.

Yep, but if NAS it attached and DVC cache (.dvc/cache by default) points to NAS, then DVC won’t be creating copies. That’s the beauty of it. It will be manipulating symlinks, data is still staying on NAS.

Does it makes sense? We can discuss I guess some specific commands to make that happen.

First, a copy needs to be made to the server:/home/user_A/data_registry/ filesystem.

Same here. DVC can just manipulate symlinks. So you first do git clone, then dvc pull dataset1. Assuming that DVC is setup correctly (cache points to NAS, symlinks are enabled) this operation should be relatively quick and should not be moving data anywhere. User then can add files into the dataset1 and do dvc commit + git push + dvc push (if you also use S3 for backup for example).

Same, when they want to use any data - they can do dvc import to symlink a specific version of their dataset into their project. Again data is not copied if everything is done correctly.

The clarification is that many datasets in /nas/datasets won’t be under DVC

That’s fine. Again, on NAS you can keep also a cloned version of the data registry + some actual datasets inside it that for now you can gitignore for example.

Once caveat with everything I’ve described so far usually is how to bootstrap this setup initially. How to get the whole dataset on NAS into DVC cache w/o copying it locally.

Please always have a backup first time you are doing this.

There are two ways. Do this on NAS itself. Like literally:

cd NAS/data
git init
dvc init
dvc config ... # set symlinks
dvc cache dir ... # set cache location
dvc add dataset1
...
git commit dataset1.dvc -m "first version of the first dataset"
git push # to some GH location

If for some reason Git or DVC do not work well on NAS fs (it happens quite often, since NAS fs can be quite specific). You could use example.

rcasero · August 11, 2022, 9:40am

Thanks @shcheklein . I think I have not explained the data workflow too well. So, there isn’t an initial setup on NAS that one person can do, and then users can git clone and dvc pull dataset1, and maybe add some files, etc.

New datasets will be regularly copied to the NAS drive, by other people, and each dataset will typically be large and with lots of files. So, it’s an ongoing situation, with more and more datasets getting added, and that’s mostly outside of our control. The data will be on the NAS, that’s the starting point.

What I’m trying to solve is a situation where multiple users need to now and then process one of those new datasets. So, I want that 1) the user puts the dataset he/she needs to process under DVC, 2) imports it into their software gitlab project (using DVC to link to the files on the NAS), 3) write their scripts to process the dataset, 4) keep both the dataset and code under version control.

At the same time, those datasets need to be readable by the non-DVC users who put them there (it’s OK if we make the ones under DVC read-only).

Note: I didn’t want to add more complexity to the question, but I actually also want to create new datasets with the processing outputs, and those also live on the NAS, side by side with the raw data.

You are correct that the processing will happen from a (linux) server that mounts the NAS with NFS (let’s assume to /nas/datasets).

As discussed above, our first attempt at a solution, following @pmrowla 's suggestions, was to create a Data Registry in /nas/datasets, to directly dvc add datasets as needed. But then we have clashes between users, because they are issuing git and dvc commands in the same directory.

I think we are both in agreement that 1) it’d be good to have each user have a clone of the data registry and that 2) the folders with the processing scripts and imported datasets should be on the server, 3) we have a shared cache on the NAS for DVC.

But I think that the folder with the processing scripts cannot be used to do the dvc add dataset1, because that would require copying the dataset itself to the server home directory (as discussed above, this adds an extra copying of data around, and there’s not enough space for that initial copy of the data that undergoes the dvc add anyway).

Maybe it’d simplify the discussion if we think about it this way: “The data lives on the NAS and cannot leave the NAS”?

Thus, as far as I can tell, the workflow needs to be broken down into two steps:

the user has a Data Registry clone on the NAS, and she/he can dvc add the data from there. Then, the user has to dvc pull in the common Data Registry, which is the common location where everybody can find all data files. Everything will be hardlinks, so data is not duplicated or actually moved.
the user also has a folder for writing processing scripts on the server, and she/he can dvc import the dataset there, which will happen with softlinks.

shcheklein · August 11, 2022, 5:19pm

I think we are on the same page with this. My point was that it can be done in a way that data always stays on NAS but users are doing git clone data-registry + dvc add, etc in their own home directory. Sorry if that was / still not clear from all the answers I gave before. It might be because it’s quite counterintuitive. (or it might be I underestimate, don’t know something - please correct me then).

the user also has a folder for writing processing scripts on the server, and she/he can dvc import the dataset there, which will happen with softlinks.

The idea behind updating the data registry is the same. It can be done with soft links. So, let’s imagine the workflow. Let’s say we have a new location on NAS /nas/registry/new-datasets and it’s not yet in the data registry. We want to add it (so that it also stays in the NAS, and not data copying happening, etc, etc). We have two options for that.

First is the one you do now (I think). Pretty much /nas/registry itself is a clone of a data registry. Users can go there, do:

dvc add new-dataset
git commit new-dataset.dvc -m "add new dataset"
git push
dvc push # if you have a remote storage for backup

You concern that multiple users might come simultaneously and try to do this? Can btw describe some exact case when they collide into each other? I would assume this to be rare to be honest.

Second option is to do this:

cd ~
git clone data-registry-repo
cd  data-registry-repo

# both options below can be setup in the repo config and shared if mount point for NAS is stable
dvc cache dir /nas/registry/.dvc/cache
dvc config cache.type hardlink,symlink 

# this command doesn't copy data localy, at least it should not be doing that
dvc add /nas/registry/new-dataset -o new-dataset
git commit new-dataset.dvc -m "add new dataset"
git push
dvc push # if you need it

In the second option, you would still probably want to have /nas/registry as a Git repo, and you can do git pull + dvc checkout --relink new-dataset there to also make /nas/regiatry/new-registry linked via hardlinks / symlinks.

Note: I didn’t want to add more complexity to the question, but I actually also want to create new datasets with the processing outputs, and those also live on the NAS, side by side with the raw data.

That’s interesting and indeed can complicate things. Do you expect all the users to be able to put those processed outputs there? Or some ETL process?

allenyl · August 13, 2022, 2:28am

Hi @rcasero , I think your problem is similar to our’s, and I’m trying to propose a solution here:

github.com/iterative/dvc

Git tracked bare DVC repo (only tracking .dvc file, but don't checkout real file)

opened 05:49AM - 05 Aug 22 UTC

allenyllee

awaiting response discussion

# Background We have a lot of daily generated log file, we want to use dvc to t…racking our daily log. ## Current Method If we want to use dvc to tracking our daily log, for now, we have to: 1. Create a git repo and `dvc init` 2. Copy the log files into git repo 3. `dvc add` those files and `dvc commit` to generate `.dvc` file, and then `dvc push` to transfer files to remote. 4. `git commit` the generated `.dvc` file, and `git tag` to add a time stamp(or version) 5. To save local space, remove all the log files, only leave `.dvc` files When new daily logs coming, we need to repeat 2-5 step for tracking. When someone need to analyse log files, they need to: Clone the git repo, `git checkout` a tagged version, and `dvc checkout` to download files to the local. ## Proposed Method 1. Provide a single dvc command (something like `dvc init --bare --remote` or a Python API) to create a **"git tracked bare dvc repo"** in remote machine 2. Provide a single dvc command (something like `dvc push --transfer --remote` or a Python API) to directly transfer daily log to the remote, this command has a `--tag` option, it will do the above 2-5 step in the remote machine. 3. When daily logs coming, just do step 2 to transfer files with version tag. (no need to copy into a local git repo) 4. When someone need to analyse log files, they can: Clone the **"git tracked bare dvc repo"** with only `.dvc` files, `git checkout` a tagged version, and `dvc checkout` to download files. Further, because the **"git tracked bare dvc repo"** should only be modified by the data owner, someone can not push their code to the **"git tracked bare dvc repo"** remote. Instead, they created a new git repo, and add **"git tracked bare dvc repo"** as a another git remote. In the git graph, they can see two parallel line: one for our data repo, one for their code repo. They can cherry-pick a commit from data repo, move `.dvc` file into other folder, then do `dvc checkout`, the file will pull from our data repo, downloaded into their folder, then they can start writing their code, commit to their git remote. ## Sum up The **"git tracked bare dvc repo"** we can treat it as a combination of `git bare repo` and `dvc cache`, it's a whole structure only for tracking data blob. It can see as a regular git remote, import as a `git submodule`, but can only modified by data owner. For the developer, they just include it, pull the data, do their experiments, push to their own repo without touching the data repo. Also, If you don't use git, you can still treat it as a regular dvc cache remote. But with git, you have full power of git! ## Advance If you have multiple data source and want to share a single data repo, one can provide `--source` option in proposed step 2, then the command will create a git branch with provided source name. This newly created branch is parallel to other source branch (with no common commit). From developer's view, they can see many parallel branch resides in data repo, and they just need to pick a branch (a data source) to merge into their local working branch. In case the data owner needs to merge two data source into one, it can be as easy as using `git merge` in the data repo, to merge two parallel data source branch into one branch!

You can join our discussion if you like the proposed idea.

rcasero · August 15, 2022, 8:41am

I like the idea, I’m looking into it, @shcheklein

rcasero · August 21, 2022, 11:27am

Hi @shcheklein ,

I went over your suggestion these last week. Sorry, it takes a bit of time to test these things internally, one has to be careful. I liked it very much, as I said.

This happened as soon as we had a group session to teach dvc. Imagine that you have a directory with the Data Registry, /nas/data_registry/ (let’s assume it’s configured with autostage, so we don’t need to be doing git add)

# user A
cd /nas/data_registry/
dvc add dataset_foo

# user B
cd /nas/data_registry/
dvc add dataset_bar

# user A
git commit -m "adding dataset_foo"

At this point, user A has committed also dataset_bar. And user B will have no idea of what happened.

In addition, there are worse issues, although some of those would probably happen with data registry clones as long as there’s a shared cache. But if we are sharing the same directory, then one user can more easily break it for everyone.

If user A is misconfigured to create files by default with userA group, instead of with a common group to all users, suddenly, there are directories/files in the cache that can only be written by that user. This happens even if dvc is configured with the shared = group option. The other users start getting weird “no permission” errors.

Compartmentalising what each user can do in their own data registry clone should prevent at least the first type of problems.

This is the option I’ve tried. It seems to work for me, although I still have to fix some of those datasets. It’s a bit tricky, in the sense that each user needs to remember quite a few steps. Also, it’s fragile in that if you dvc add foo_dataset and then again dvc add foo_dataset, that creates a foo_dataset/foo_dataset. So, in the end, I’ve written a script so that users have a single simple command to add a dataset with our configuration. The relevant part, based on your suggestions, is

  # add the dataset to the Data Registry with DVC. Note that we use option "-o" to avoid making a
  # a local copy of the data in this Data Registry clone. Instead, this creates a local directory
  # with symlinks to the actual data files in the common Data Registry 
  echo_if_verbose "Putting dataset under DVC"
  dvc add "${dataset_path}" -o "${dataset}"

  # push the changes to the Data Registry project in the Gitlab server. 
  # Note that DVC is configured with autostage, so we don't need to "git add" the new dataset.dvc file
  echo_if_verbose "Pushing changes to Data Registry project in Gitlab server"
  git commit "${dataset}".dvc .dvcignore -m "add dataset ${dataset}"
  git push

  # pull the new changes in the common Data Registry
  echo_if_verbose "Pulling new changes into common Data Registry: $datasets_dir"
  pushd "$datasets_dir"
  git pull

  # relink the dataset files to the cache using hardlinks, so that data is not duplicated
  echo_if_verbose "Relinking dataset files to the DVC cache using hardlinks"
  dvc checkout --relink "${dataset}".dvc
  popd

  echo_if_verbose "Dataset successfully added to the Data Registry"

This is something that I’m still getting my head around. I think I’ll open another thread once I have some positive or negative results with what I’m going to try.

Topic		Replies	Views
Dvc_api.get_url is not working with external data Questions	10	1026	June 28, 2022
Data (registry) and remote GPU cluster with local DVC repositories Questions	6	718	July 5, 2022
Single cache or multiple caches in NAS with External Data Questions	2	1246	May 26, 2022
Setup DVC to work with shared data on NAS server Questions	10	15177	June 12, 2019
Direct copy between Shared Cache and External Dependencies/Outputs Questions	10	1851	June 3, 2021

Large Data Registry on NAS with multiple DVC and non-DVC users

Related topics