Could you give a little bit more context, please:
- How long was it taking before you interrupted the process?
- How do you mount your NAS storage? NFS, something else?
- How is your code written now that processes this data? Does it address it via the
/mnt/dataset/project1_data/ path? Or do you use symlinks to the projects directory?
It looks like in your case it would be beneficial to keep DVC cache (a place where all versions of all datasets, models, etc are stored) on your NAS. Thus you would avoid copying large files from you NAS to the machine (let me know if you actually want to have a copy of your data in your project’s workspace).
dvc cache dir /mnt/dataset/storage
dvc config cache.type "reflink,symlink,hardlink,copy" # to enable symlinks to avoid copying
dvc config cache.protected true # to make links RO so that we you don't corrupt them accidentally
git add .dvc .gitignore
git commit . -m "initialize DVC"
Now, to add a first version of the dataset into the DVC cache (this is done once for a dataset), I would do the following:
cp -r /home/user/project1/
mv /mnt/dataset/project1_data/ data/
dvc add data # it should be way faster now
git add data.dvc .gitignore
git commit . -m "add first version of the dataset"
git tag -a "v1.0" -m "dataset v1.0"
git push origin HEAD
git push origin v1.0
Next, in your project, you can do something like this:
# you should see data.dvc file now
# you should see the data directory now that should be symlink to the NAS storage