we are trying to use DVC very close to the described shared-development-server use-case. This is great for starting, but the devil is in the details.
The setup is that we use a shared storage in an HPC environment, where we try to reduce data duplication.
We use a shared cache and because we later might want to connect this to a remote storage, we create a local “remote” storage for the moment.
In the setup we have 2 different types of data. Smaller data files, that are ok if they get downloaded to the individual development folders and a folder containing millions of large files that we don’t want to duplicate.
We have 2 issues now.
Firstly, the local “remote” causes error messages because of writing conflicts. We used the group setting “dvc config cache.shared group”, which seems to work in the cache, but if a folder on the remote 00 to FF seems to be reused, we get write conflicts during “dvc push”.
Secondly, we are not sure how to fare with the large folder with the millions of large files. We probably should create a link to the folder from each individual directory. The data in this folder is generated elsewhere. We want to make sure if we ever have to re-run the data generation, we are informed if this lead to different files. Hence, we want to hash the folder or it’s content, but we are not sure if it’s viable to do this for this amount of files individually. What would be best solution here?
Could you provide more info on how such conflict looks like? Do you, by any chance have some log showing those errors?
As to your second question: I presume each dev/scientist have hers/his own folder on this HPC environment, right? I believe it would be possible to even work with the “big” dataset, if you set cache type to reflink, symlink, or hardlink. Take a look here: Large Dataset Optimization.
Is your data versioning handled by git/dvc repository? If so, you can always utilize dvc import and dvc get to obtain the smaller dataset, if *link optimizations are not what you are looking for.
510 git checkout -b dev/obsolete
512 echo “test” > data/test.txt
513 dvc add data/test.txt
514 git add data/.gitignore data/test.txt.dvc
515 git commit -m “X”.
516 dvc push
ERROR: failed to upload ‘…/…/…/…/data/mm/shared/dvc/d8/e8fca2dc0f896fd7cb4cb0031ba249’ to ‘…/…/…/…/data/mm/shared/dvc-storage/d8/e8fca2dc0f896fd7cb4cb0031ba249’ - [Errno 13] Permission denied: ‘/data/mm/shared/dvc-storage/d8/e8fca2dc0f896fd7cb4cb0031ba249.hzdYodwh7UgSBGuVAD4iSE.tmp’
ERROR: failed to push data to the cloud - 1 files failed to upload
The differences in the dvc (cache) and dvc-strorage (local remote) is that the files and folder get created with permissions on group level but not in the uplink case. So there is a new file, we can add it to the cache, but if the folder exists already in the “local remote” that folder has user specific permission set, which prevents in upload. The target folder seems to iterate, in this case d8. Maybe we could force all of these folder to have the right permissions if not further folder are added after this point?
I will have a look at the link that you provided, to see if this is already what we are looking for.