Version control of the raw data with the colleagues simultaneously

Hello, I recently came across DVC while looking for a tool for data versioning and I’ve found DVC very useful, so I’m testing several features for data versioning with my colleagues.

But we have difficulties with managing raw dataset when we deal with it simultaneously and want to ask you some help/guide.

First, we describe our system as follows:(Assume there are two colleagues - denoted as Col1, Col2, respectively)

  • Col1 & Col2 have a workspace respectively (denoted as WS1 and WS2) and each workspace has ‘mounted NAS’ which stores image data and should be shared.
    Path for WS1 : /home/work/dvc_col1 %% Here Col1 execute git init, dvc init, etc.
    Path for image data to share among colleagues in NAS: /home/work/mnt/storage/img_data

    Path for WS2: /home/work/dvc_col2 %% Here Col2 execute git init, dvc init, etc.
    Path for image data to share among colleagues in NAS: /home/work/mnt/storage/img_data

Col1 initially starts data versioning following DVC docs as follows:
Since the data to be versioned is located outside of the workspace, Col1 execute the followings in WS1:

dvc cache dir /home/work/mnt/save_cache
dvc config cache.shared group
git add .dvc
git commit -m "set cache dir"
dvc add /home/work/mnt/storage/img_data --external
git add img_data.dvc
git commit -m "1st version data, 300 images"
git tag -a "v1.0" -m "1st version"

(—>> This 1st version includes 300 images.)

  • And additional 200 images are added to /home/work/mnt/storage/img_data. (so in total 500 images.)
    To mange the version of the dataset, Col1 execute the followings:

    dvc add /home/work/mnt/storage/img_data --external
    git add img_data.dvc
    git commit -m “2nd version data, 200 images added, in total 500 images”
    git tag -a “v2.0” -m “2nd version”

  • Then Col1 can version the image data with ‘git checkout v1.0 or v2.0’ and ‘dvc checkout’.

  • So, Col1 uses ‘git push’ to their GitLab project and Col2 downloaded it using git pull.

  • After this, Col2 can also version the same image data with ‘git checkout’ and ‘dvc checkout’ and it works fine.

But problem arises here:

  • When Col1 execute ‘git checkout v2.0’ & ‘dvc checkout’ in WS1, the number of images in ‘/home/work/mnt/storage/img_data’ becomes 500 and Col2 can also recognize it.

  • Since Col2 wants to use the 1st version of dataset, Col2 execute ‘git checkout v1.0’ & ‘dvc checkout’ in WS2, and the number of images in ‘/home/work/mnt/storage/img_data’ becomes 300. This makes problem, because now Col1 can’t access the added 200 images anymore. ‘/home/work/mnt/storage/img_data’ shows only 300 images, so Col1 and Col2 can’t access to the dataset each wants to use SIMULTANEOUSLY.

Are we using DVC wrong? We want to access each of the versioned dataset simultaneous with DVC. We’re very appreciated with your comment and help. Thank you.

1 Like

Hi @schakal . You should avoid to use “external” and “shared cache” strategies combined.

From the Managing External Data guide:

:warning: An external cache could be shared among copies of a DVC project. Please do not use external outputs in that scenario, as dvc checkout in any project would overwrite the working data for all projects.

What’s the use case behind wanting to use external and shared cache combined? Would some alternatives work for you?

Highlighted from the above link:

In most cases, alternatives like the to-cache or to-remote strategies of dvc add and dvc import-url are more convenient.

@daavoo Thank you for the answer.

Could you share some example code that can be used for our case?

I’m not sure if any other method other than dvc checkout can version control the shared datasets.

I think that you should be able to use the same code you shared but just removing the --external option.

@schakal Hello, I’m also struggling solve the same situation. Have you solved it?

@schakal @eririri
What @daavo is pointing out here, is that in the case of external datasets, we do not recommend simultaneous work on them. What you could do, is to create a separate repository that would only take care of versioning the data (so called data registry: https://dvc.org/doc/use-cases/data-registries) And then leverage dvc import command to obtain the particular version of the dataset in your project, so that both colleagues can work on the proper dataset version.