Setup DVC to work with shared data on NAS server

Hi DVC team!

Thanks for providing these wonderful tools. Our scenario is that we have a server with NAS storage and many projects need to use data stored in NAS. The storage will be updated so the size of data will grow every day

/mnt/dataset/project1_data/

And the model code is listed in

/home/user/project1/

The project1_data takes about 100 GB, so when I used dvc add /mnt/dataset/project1_data/, it takes so long that I shut down it before it runs to end.

My question is how do we use DVC to do data version control and code control based on NAS storage? I have a look into Shared Development Server but not clear for adding data from NAS.

Thanks again!

Hi @jimmy15923 !

That happens because when you are running dvc add /mnt/dataset/project1_data, dvc is trying to cache your data, by storing it .dvc/cache and creating links back to /mnt/dataset/project1_data. Did you mean to cache it and version it with dvc or do you just want to specify it as a dependency for your next stage without caching it? If it is the case, then you could simply specify it as an external dependency in your dvc run command like so dvc run -d /mnt/dataset/project1_data ....

Thanks,
Ruslan

Hi @kupruser!

Thanks for your reply!
Since the /mnt/dataset/project1_data will be updated every day, I want to version it (be able to checkout to old version).
Sorry about that I am not clear for “specify it as a dependency for your next stage without caching it?”, so I don’t know which case to use.

1 Like

Hi @jimmy15923!

Could you give a little bit more context, please:

  1. How long was it taking before you interrupted the process?
  2. How do you mount your NAS storage? NFS, something else?
  3. How is your code written now that processes this data? Does it address it via the /mnt/dataset/project1_data/ path? Or do you use symlinks to the projects directory?

It looks like in your case it would be beneficial to keep DVC cache (a place where all versions of all datasets, models, etc are stored) on your NAS. Thus you would avoid copying large files from you NAS to the machine (let me know if you actually want to have a copy of your data in your project’s workspace).

cd /home/user/project1/
dvc init
dvc cache dir /mnt/dataset/storage
dvc config cache.type "reflink,symlink,hardlink,copy" # to enable symlinks to avoid copying
dvc config cache.protected true # to make links RO so that we you don't corrupt them accidentally
git add .dvc .gitignore
git commit . -m "initialize DVC"

Now, to add a first version of the dataset into the DVC cache (this is done once for a dataset), I would do the following:

cd /mnt/dataset
cp -r /home/user/project1/
cd project1
mv /mnt/dataset/project1_data/ data/
dvc add data # it should be way faster now
git add data.dvc .gitignore
git commit . -m "add first version of the dataset"
git tag -a "v1.0" -m "dataset v1.0"
git push origin HEAD
git push origin v1.0

Next, in your project, you can do something like this:

cd /home/user/project1/
git pull
# you should see data.dvc file now
dvc checkout
# you should see the data directory now that should be symlink to the NAS storage
3 Likes

Hi @shcheklein!

Really appreciate your reply and sorry I just notice it until today!

About your questions:

  1. How long was it taking before you interrupted the process?
    For about 10 mins because it looks like needs lots of time to finish

  2. How do you mount your NAS storage? NFS, something else?
    It’s NFS so everyone in the server can access the NAS

  3. How is your code written now that processes this data? Does it address it via the /mnt/dataset/project1_data/ path? Or do you use symlinks to the projects directory?
    I will load the data directly from NAS, codes below

cv2.imread("/mnt/dataset/project1_data/imag.png")

I agree that it’s beneficial to keep DVC cache in NAS, but I can’t find an example in documentation.

Thanks again for your sample code, I will try it :grinning:!

2 Likes

Hi @shcheklein!

Thanks for your code and it works perfectly in my case.

Just two questions

  1. I have a parse_data.py, which will parse new data and save into /mnt/dataset/project1_data/ daily. So I will get new images in project_data1 everyday.
    Base on your code, you move all images into the copy project1_data
cd /mnt/dataset/project1
mv /mnt/dataset/project1_data/*.png  data/

Can I simply revise the code in parse_data.py and save data directly into /mnt/dataset/project1/data/?

  1. When adding new data, dvc will unprotect all data which cost lots of time (about 20 mins), is this inevitable?
    image
1 Like

Hi @jimmy15923 !

That is strange. What dvc version are you using?e

Hi!
I think it’s the latest?
image

1 Like

Hi, @jimmy15923!

  1. Yes, absolutely! You can change your code and write directly to the data folder.

(the concern is that we have seen that DVC hangs when you try to run it directly from NFS, let us know if you hit this issue - we’ll figure out a workaround)

  1. Could you share a little bit more details please? Your config file (.dvc/config) for the repo in which it is happening. And the sequence of commands/steps to reproduce?

Hi, @shcheklein!

Thanks for your reply. I had change the cache.protected = False, but still not works. Here is my config

root@be9f4fa8f4b2:/home/jimmy15923/dvc_project# cat .dvc/config
[cache]
dir = /mnt/dataset/dvc_exp/
type = symlink
protected = false

I follow your suggestion and save the new data directly into

/mnt/dataset/project1/data (which is copy from /home/user/project1).

Then I used

dvc add data
image

You can see that it is unprotecting all data! I also have two large files and it seems that it always re-computing md5 (not only done once). Did I set anything wrong? Many thanks for your support!

Hi @jimmy15923 !

Indeed, when re-adding your data, dvc first unprotects it to be able to work on it. We should definitely act smarter about it. Could you please create an issue for that on out github?

The checksum re-computation happens for that same reason. Since you are using symlinks, after unprotect that file has a different inode, so dvc is not able to find checksum for it in our state db and so has to re-compute it again. We could fix that by saving an entry to the state db when we are unprotecting something that we already know the checksum for. Please be sure to link our conversation here in the created github issue.

Thanks,
Ruslan