We have a large NAS with large datasets both in number of files and file size. Those data sets need to be used by several gitlab projects (data wrangling, data processing, modelling, making plots, etc).
We want to have some kind of version control of the data (mainly, say “this experiment was run with these data” and check that it hasn’t changed), so I have made some simple tests with DVC, but I’d like the opinion of people with more insight. My main question is whether I should set up a single shared cache for all projects, or independent caches for each project (and whether my proposed solution is sound at all). A secondary question is whether the setup will allow to run experiments on the cloud down the line. Here are the details of the problem and my attempted solutions:
Main constraints:
- we cannot migrate the data to a flat directory structure where files get renamed with their hash, so the data has to co-exist with the cache. That is, we need to maintain the original directory structure and filenames in the NAS as they are
- the NAS data needs to be accessed and processed by several people from several machines (the NAS is an ext2 filesystem that gets mounted with NFS on those machines), typically by cloning the gitlab project to their own /home/john_doe/Software/my_cats_projects local directory.
- we cannot have duplicates of the data, e.g. the one on the NAS and then on each machine that needs to process it
The solution I have come up with, and that seems to work is to combine DVC’s “Managing External Data” (https://dvc.org/doc/user-guide/managing-external-data) and “Large Dataset Optimization” (https://dvc.org/doc/user-guide/large-dataset-optimization).
We can have gitlab projects with a softlink to the external data in the NAS, e.g.
my_cats_project
├── bin
│ └── process_some_data.sh
├── data
│ └── some_small_local_data.csv
├── external_data → /nas_server/laboratory_1/big_dataset_with_cats/
| ├── cat_movie_001.mov
| └── cat_movie_002.mov
├── README.md
└── src
└── python_code.py
We initialise dvc in the gitlab project
$ cd my_data_project
$ dvc init
We then tell dvc to use an external cache on the NAS. This is the main part of my question, whether it’s better to use a single external cache living on the NAS for all datasets and projects,
$ dvc cache dir /nas_server/common_dvc_cache
or have separate caches for each project,
$ dvc cache dir /nas_server/projects_dvc_cache/my_cats_project_dvc_cache
Either way, the cache or caches will live on the NAS, avoiding transferring large data files to /home directories.
To avoid data duplication, we configure the cache to use hardlinks (as reflinks are not available on ext2)
$ dvc config cache.type hardlink
This way, the original data files (cat_movie_001.mov) and their cache “copies” (asd9890w908ad9sfasd9fasdf980asfd) will point to the same inode.
In this set up, data cannot be overwritten or updates. For example, if we want new versions of the cat movies above, we’ll have to create new directories for those, but that’s fine in this case.
DVC will make the original files non-writable to protect the cache from being corrupted. But we can still move the original data files on the NAS using “dvc move”, right?
Finally, I’d be grateful for any insights on whether this setup would allow to then run projects on cloud service providers supported by DVC.
Thanks a lot!
Ramon.