Hello, I recently came across DVC while looking for a tool for data versioning and I’ve found DVC very useful, so I’m testing several features for data versioning with my colleagues.
But we have difficulties with managing raw dataset when we deal with it simultaneously and want to ask you some help/guide.
First, we describe our system as follows:(Assume there are two colleagues - denoted as Col1, Col2, respectively)
-
Col1 & Col2 have a workspace respectively (denoted as WS1 and WS2) and each workspace has ‘mounted NAS’ which stores image data and should be shared.
Path for WS1 : /home/work/dvc_col1 %% Here Col1 execute git init, dvc init, etc.
Path for image data to share among colleagues in NAS: /home/work/mnt/storage/img_dataPath for WS2: /home/work/dvc_col2 %% Here Col2 execute git init, dvc init, etc.
Path for image data to share among colleagues in NAS: /home/work/mnt/storage/img_data
Col1 initially starts data versioning following DVC docs as follows:
Since the data to be versioned is located outside of the workspace, Col1 execute the followings in WS1:
dvc cache dir /home/work/mnt/save_cache
dvc config cache.shared group
git add .dvc
git commit -m "set cache dir"
dvc add /home/work/mnt/storage/img_data --external
git add img_data.dvc
git commit -m "1st version data, 300 images"
git tag -a "v1.0" -m "1st version"
(—>> This 1st version includes 300 images.)
-
And additional 200 images are added to /home/work/mnt/storage/img_data. (so in total 500 images.)
To mange the version of the dataset, Col1 execute the followings:dvc add /home/work/mnt/storage/img_data --external
git add img_data.dvc
git commit -m “2nd version data, 200 images added, in total 500 images”
git tag -a “v2.0” -m “2nd version” -
Then Col1 can version the image data with ‘git checkout v1.0 or v2.0’ and ‘dvc checkout’.
-
So, Col1 uses ‘git push’ to their GitLab project and Col2 downloaded it using git pull.
-
After this, Col2 can also version the same image data with ‘git checkout’ and ‘dvc checkout’ and it works fine.
But problem arises here:
-
When Col1 execute ‘git checkout v2.0’ & ‘dvc checkout’ in WS1, the number of images in ‘/home/work/mnt/storage/img_data’ becomes 500 and Col2 can also recognize it.
-
Since Col2 wants to use the 1st version of dataset, Col2 execute ‘git checkout v1.0’ & ‘dvc checkout’ in WS2, and the number of images in ‘/home/work/mnt/storage/img_data’ becomes 300. This makes problem, because now Col1 can’t access the added 200 images anymore. ‘/home/work/mnt/storage/img_data’ shows only 300 images, so Col1 and Col2 can’t access to the dataset each wants to use SIMULTANEOUSLY.
Are we using DVC wrong? We want to access each of the versioned dataset simultaneous with DVC. We’re very appreciated with your comment and help. Thank you.