Dear DVC community,
I am currently about to develop a data repository based on GIT+DVC. I am new to DVC, hence might still overlook things.
Let us assume to have the following repository after having cloned the GIT repo and pulled everything via DVC.
repo1
|- data_file1
|- data_file1.dvc
|- data_file2
|- data_file2.dvc
Now, another user of the repository has meanwhile cloned & dvc pulled the repository in an identical way to a directory repo2 (on a different machine, i.e. no cache issues, here). But that user has modified data_file2, followed by all that is necessary to push these changes to the GIT repo and DVC remote storage (in autostage mode, this is: dvc add data_file2, git commit, git push, dvc push). At the end, that user will have
repo2
|- data_file1
|- data_file1.dvc
|- data_file2 <- different to repo1
|- data_file2.dvc <- different to repo1
which will also be the latest version on the server.
Now, as a common habit, the first user with his repo1 directory will do a “git pull” and will have
repo1
|- data_file1
|- data_file1.dvc
|- data_file2 <- as originally present in repo1
|- data_file2.dvc <- as uploaded by the second user, i.e identical to repo2
If this first user issues
dvc status data_file2.dvc
it will get with my dvc version 2.11.0 this output
data_file2.dvc:
changed outs:
not in cache: data_file2
which is understandable (but still somewhat misleading), as indeed data_file2, as uploaded from repo2, is not in the local cache. If the first user then issues
dvc status --remote myremote
where myremote is the remote used by both users, the result is
deleted: data_file2
This is however now confusing, as on the remote, the file was changed, but not deleted.
Am I missing something here?
Somewhat related in the same setting: What is the suggested way to get the latest version then of data_file2? Would it just be a general “dvc pull”?
But what if one has a huge repo and only did partial pulls on individual .dvc files. Would one then have to manually go through all these files? It feels like a “dvc update” that only updates already pulled contents would be good, here. But my take is that “dvc update” only does this for files that have been imported from another repo…
Thank you so much for your patience and your help!