Single cache or multiple caches in NAS with External Data

We have a large NAS with large datasets both in number of files and file size. Those data sets need to be used by several gitlab projects (data wrangling, data processing, modelling, making plots, etc).

We want to have some kind of version control of the data (mainly, say “this experiment was run with these data” and check that it hasn’t changed), so I have made some simple tests with DVC, but I’d like the opinion of people with more insight. My main question is whether I should set up a single shared cache for all projects, or independent caches for each project (and whether my proposed solution is sound at all). A secondary question is whether the setup will allow to run experiments on the cloud down the line. Here are the details of the problem and my attempted solutions:

Main constraints:

  • we cannot migrate the data to a flat directory structure where files get renamed with their hash, so the data has to co-exist with the cache. That is, we need to maintain the original directory structure and filenames in the NAS as they are
  • the NAS data needs to be accessed and processed by several people from several machines (the NAS is an ext2 filesystem that gets mounted with NFS on those machines), typically by cloning the gitlab project to their own /home/john_doe/Software/my_cats_projects local directory.
  • we cannot have duplicates of the data, e.g. the one on the NAS and then on each machine that needs to process it

The solution I have come up with, and that seems to work is to combine DVC’s “Managing External Data” (https://dvc.org/doc/user-guide/managing-external-data) and “Large Dataset Optimization” (https://dvc.org/doc/user-guide/large-dataset-optimization).

We can have gitlab projects with a softlink to the external data in the NAS, e.g.

my_cats_project
├── bin
│ └── process_some_data.sh
├── data
│ └── some_small_local_data.csv
├── external_data → /nas_server/laboratory_1/big_dataset_with_cats/
| ├── cat_movie_001.mov
| └── cat_movie_002.mov
├── README.md
└── src
└── python_code.py

We initialise dvc in the gitlab project

$ cd my_data_project
$ dvc init

We then tell dvc to use an external cache on the NAS. This is the main part of my question, whether it’s better to use a single external cache living on the NAS for all datasets and projects,

$ dvc cache dir /nas_server/common_dvc_cache

or have separate caches for each project,

$ dvc cache dir /nas_server/projects_dvc_cache/my_cats_project_dvc_cache

Either way, the cache or caches will live on the NAS, avoiding transferring large data files to /home directories.

To avoid data duplication, we configure the cache to use hardlinks (as reflinks are not available on ext2)

$ dvc config cache.type hardlink

This way, the original data files (cat_movie_001.mov) and their cache “copies” (asd9890w908ad9sfasd9fasdf980asfd) will point to the same inode.

In this set up, data cannot be overwritten or updates. For example, if we want new versions of the cat movies above, we’ll have to create new directories for those, but that’s fine in this case.

DVC will make the original files non-writable to protect the cache from being corrupted. But we can still move the original data files on the NAS using “dvc move”, right?

Finally, I’d be grateful for any insights on whether this setup would allow to then run projects on cloud service providers supported by DVC.

Thanks a lot!

Ramon.

  • we cannot migrate the data to a flat directory structure where files get renamed with their hash, so the data has to co-exist with the cache. That is, we need to maintain the original directory structure and filenames in the NAS as they are

Can you elaborate on this requirement?

Using external data is generally not recommended and is only intended as a workaround for very specific use cases (such as where data absolutely cannot be moved/reorganized due to security or compliance reasons).

The way that DVC works is that the “original directory structure and filenames” are kept in a Git/DVC repository (where DVC keeps track of the mapping between the original structure and the actual file hashes).

Finally, I’d be grateful for any insights on whether this setup would allow to then run projects on cloud service providers supported by DVC.

Can you clarify what you mean by “run projects on cloud service providers” here?

If you are asking about DVC remotes, the answer is no, If you are using the external data feature, it cannot be pushed/pulled to/from DVC cloud remotes.


I think the typical solution for your problem would be to set up a main/central DVC repository as a data registry. The data registry repo would replace your existing setup (so that the data registry repo would follow the “original directory structure and filenames in the NAS”, and it would handle keeping track of the mapping from the original structure to the DVC content-addressable file hash storage structure).

Then, all of your other projects would import the data from the data registry as needed.

With regard to keeping data only on the NAS, all of your projects (including the data registry itself) could be configured to use a shared DVC cache which is kept on the NAS (as described in the large dataset optimization docs).

So the end result here would be:

  • All actual data is only stored on the NAS (in the DVC cache structure), and only a single copy of each file exists on the NAS
  • The original file/directory structure is managed/tracked by the data registry project.
    • With regard to managing the data registry repo itself, a “local” copy of that repo could be kept anywhere, whether it that is on the NAS itself (and configured to use hardlinks) or elsewhere (and configured to use symlinks)
  • For each user’s individual projects, they would have their own local clone of the gitlab project configured to use symlinks (user’s local machines would not store any of the actual data, they would only have symlinks pointing to the NAS)

This setup would also allow you to push/pull data from the shared cache on the NAS to DVC cloud remotes.

2 Likes

Hi pmrowla,

Thanks for your insightful reply, and sorry it has taken me a while to implement the ideas. I think I have it working now (details below, in case you’d like to check).

Can you elaborate on this requirement?

What I meant by “we cannot migrate the data to a flat directory structure where files get renamed with their hash” is that the data files still need to exist and be accessible to users who have no git/dvc knowledge and will use Mac Finder / Windows File Explorer / Linux Nautilus to copy and browse files on the NAS without using DVC. Your solution enables this, so that’s great.

We then want to be able to put some of those data sets under version control. Your solution of the Data Registry allows that too.

I configured the Data Registry with

dvc cache dir ../dvc_cache/
dvc config cache.shared group
dvc config cache.type hardlink

so I expected that directories that were be added with dvc add would be write-protected, as the documentation says. However, I can still see them as writable

drwxr-xr-x. 4 user group 4.0K May 24 10:09 foo_dataset
-rw-r--r--. 1 user group   93 May 25 14:47 foo_dataset.dvc

Can you clarify what you mean by “run projects on cloud service providers” here?

What I meant, now updated with your Data Registry suggestion, is that we have our local NAS with datasets and the Data Registry, and then we have software experiments on another local GPU server, namely we write a python script to process a dataset. We put that dataset under version control in the Data Registry following your advice. Now, we can import the dataset into the experiment to benefit from data version control using dvc import.

But, after doing the prep work locally, we want to run this on AWS. The NAS already gets copied regularly to AWS anyway, so the data will be there, we want to avoid having to copy it to S3 buckets every time we run new experiments.

I need to think a bit more about this after having implemented your Data Registry proposal (I’ll probably come back with more questions!)

Details about our current implementation, after your advice:

The NAS has a datasets directory, with subdirectories that contain data sets with data files, e.g.

datasets/
└── foo_dataset
    ├── donor1
    │   ├── file1.jpg
    │   ├── file2.jpg
    │   └── file3.jpg
    └── donor2
        ├── file4.jpg
        ├── file5.jpg
        └── file6.jpg

I created a Gitlab repository data_registry, and pulled it into datasets.

I created a directory dvc_cache for the shared cache, at the same level as datasets.

I then configured the DVC cache in datasets

cd datasets/
dvc cache dir ../dvc_cache/
dvc config cache.shared group
dvc config cache.type hardlink
dvc config core.autostage true
git commit .dvc/config -m "Configure external shared cache in ../dvc_cache"
git push

Now we can put selected data sets under version control with

cd datasets/
dvc add foo_dataset
git commit foo_dataset.dvc .gitignore -m "Put foo_dataset under data version control"
git push

To import a data set into an experiment that lives on another server, a server that mounts the NAS on /mnt/nas, first we configure DVC for the experiment

cd ~/Software/myexperiment
# initialise DVC
dvc init
# tell the experiment to use the Data Registry shared cache as its own DVC cache
dvc cache dir /mnt/nas/dvc_cache/
dvc config cache.shared group
# if the experiment directory is in the same filesystem as the Data Registry (the Data 
# Registry lives on NAS2), then DVC will create hardlinks to the data files. If they are 
# in different filesystems (e.g. if the experiment is on NAS1), it will create softlinks
dvc config cache.type hardlink,symlink
# make "dvc add" automatically do "git add" of the files it creates/modifies/deletes. This
# stages those files so they are ready to commit
dvc config core.autostage true
# commit the changes
git commit .dvc .dvcignore .gitignore -m "Configure DVC for experiment"

then, import the data set into the experiment

dvc import GITLAB_SERVER_URL/data_registry foo_dataset
git commit .gitignore foo_dataset.dvc -m "Import dataset foo_dataset"
git push

This creates a directory foo_dataset/ with the data set files and a pointer file foo_dataset.dvc.