Single cache or multiple caches in NAS with External Data

We have a large NAS with large datasets both in number of files and file size. Those data sets need to be used by several gitlab projects (data wrangling, data processing, modelling, making plots, etc).

We want to have some kind of version control of the data (mainly, say “this experiment was run with these data” and check that it hasn’t changed), so I have made some simple tests with DVC, but I’d like the opinion of people with more insight. My main question is whether I should set up a single shared cache for all projects, or independent caches for each project (and whether my proposed solution is sound at all). A secondary question is whether the setup will allow to run experiments on the cloud down the line. Here are the details of the problem and my attempted solutions:

Main constraints:

  • we cannot migrate the data to a flat directory structure where files get renamed with their hash, so the data has to co-exist with the cache. That is, we need to maintain the original directory structure and filenames in the NAS as they are
  • the NAS data needs to be accessed and processed by several people from several machines (the NAS is an ext2 filesystem that gets mounted with NFS on those machines), typically by cloning the gitlab project to their own /home/john_doe/Software/my_cats_projects local directory.
  • we cannot have duplicates of the data, e.g. the one on the NAS and then on each machine that needs to process it

The solution I have come up with, and that seems to work is to combine DVC’s “Managing External Data” (https://dvc.org/doc/user-guide/managing-external-data) and “Large Dataset Optimization” (https://dvc.org/doc/user-guide/large-dataset-optimization).

We can have gitlab projects with a softlink to the external data in the NAS, e.g.

my_cats_project
├── bin
│ └── process_some_data.sh
├── data
│ └── some_small_local_data.csv
├── external_data → /nas_server/laboratory_1/big_dataset_with_cats/
| ├── cat_movie_001.mov
| └── cat_movie_002.mov
├── README.md
└── src
└── python_code.py

We initialise dvc in the gitlab project

$ cd my_data_project
$ dvc init

We then tell dvc to use an external cache on the NAS. This is the main part of my question, whether it’s better to use a single external cache living on the NAS for all datasets and projects,

$ dvc cache dir /nas_server/common_dvc_cache

or have separate caches for each project,

$ dvc cache dir /nas_server/projects_dvc_cache/my_cats_project_dvc_cache

Either way, the cache or caches will live on the NAS, avoiding transferring large data files to /home directories.

To avoid data duplication, we configure the cache to use hardlinks (as reflinks are not available on ext2)

$ dvc config cache.type hardlink

This way, the original data files (cat_movie_001.mov) and their cache “copies” (asd9890w908ad9sfasd9fasdf980asfd) will point to the same inode.

In this set up, data cannot be overwritten or updates. For example, if we want new versions of the cat movies above, we’ll have to create new directories for those, but that’s fine in this case.

DVC will make the original files non-writable to protect the cache from being corrupted. But we can still move the original data files on the NAS using “dvc move”, right?

Finally, I’d be grateful for any insights on whether this setup would allow to then run projects on cloud service providers supported by DVC.

Thanks a lot!

Ramon.

  • we cannot migrate the data to a flat directory structure where files get renamed with their hash, so the data has to co-exist with the cache. That is, we need to maintain the original directory structure and filenames in the NAS as they are

Can you elaborate on this requirement?

Using external data is generally not recommended and is only intended as a workaround for very specific use cases (such as where data absolutely cannot be moved/reorganized due to security or compliance reasons).

The way that DVC works is that the “original directory structure and filenames” are kept in a Git/DVC repository (where DVC keeps track of the mapping between the original structure and the actual file hashes).

Finally, I’d be grateful for any insights on whether this setup would allow to then run projects on cloud service providers supported by DVC.

Can you clarify what you mean by “run projects on cloud service providers” here?

If you are asking about DVC remotes, the answer is no, If you are using the external data feature, it cannot be pushed/pulled to/from DVC cloud remotes.


I think the typical solution for your problem would be to set up a main/central DVC repository as a data registry. The data registry repo would replace your existing setup (so that the data registry repo would follow the “original directory structure and filenames in the NAS”, and it would handle keeping track of the mapping from the original structure to the DVC content-addressable file hash storage structure).

Then, all of your other projects would import the data from the data registry as needed.

With regard to keeping data only on the NAS, all of your projects (including the data registry itself) could be configured to use a shared DVC cache which is kept on the NAS (as described in the large dataset optimization docs).

So the end result here would be:

  • All actual data is only stored on the NAS (in the DVC cache structure), and only a single copy of each file exists on the NAS
  • The original file/directory structure is managed/tracked by the data registry project.
    • With regard to managing the data registry repo itself, a “local” copy of that repo could be kept anywhere, whether it that is on the NAS itself (and configured to use hardlinks) or elsewhere (and configured to use symlinks)
  • For each user’s individual projects, they would have their own local clone of the gitlab project configured to use symlinks (user’s local machines would not store any of the actual data, they would only have symlinks pointing to the NAS)

This setup would also allow you to push/pull data from the shared cache on the NAS to DVC cloud remotes.

1 Like