Setup DVC to work with shared data on NAS server

Hi @jimmy15923!

Could you give a little bit more context, please:

  1. How long was it taking before you interrupted the process?
  2. How do you mount your NAS storage? NFS, something else?
  3. How is your code written now that processes this data? Does it address it via the /mnt/dataset/project1_data/ path? Or do you use symlinks to the projects directory?

It looks like in your case it would be beneficial to keep DVC cache (a place where all versions of all datasets, models, etc are stored) on your NAS. Thus you would avoid copying large files from you NAS to the machine (let me know if you actually want to have a copy of your data in your project’s workspace).

cd /home/user/project1/
dvc init
dvc cache dir /mnt/dataset/storage
dvc config cache.type "reflink,symlink,hardlink,copy" # to enable symlinks to avoid copying
dvc config cache.protected true # to make links RO so that we you don't corrupt them accidentally
git add .dvc .gitignore
git commit . -m "initialize DVC"

Now, to add a first version of the dataset into the DVC cache (this is done once for a dataset), I would do the following:

cd /mnt/dataset
cp -r /home/user/project1/
cd project1
mv /mnt/dataset/project1_data/ data/
dvc add data # it should be way faster now
git add data.dvc .gitignore
git commit . -m "add first version of the dataset"
git tag -a "v1.0" -m "dataset v1.0"
git push origin HEAD
git push origin v1.0

Next, in your project, you can do something like this:

cd /home/user/project1/
git pull
# you should see data.dvc file now
dvc checkout
# you should see the data directory now that should be symlink to the NAS storage
3 Likes