"dvc status" confusion in concurrent use case

Dear DVC community,

I am currently about to develop a data repository based on GIT+DVC. I am new to DVC, hence might still overlook things.

Let us assume to have the following repository after having cloned the GIT repo and pulled everything via DVC.

repo1
|- data_file1
|- data_file1.dvc
|- data_file2
|- data_file2.dvc

Now, another user of the repository has meanwhile cloned & dvc pulled the repository in an identical way to a directory repo2 (on a different machine, i.e. no cache issues, here). But that user has modified data_file2, followed by all that is necessary to push these changes to the GIT repo and DVC remote storage (in autostage mode, this is: dvc add data_file2, git commit, git push, dvc push). At the end, that user will have

repo2
|- data_file1
|- data_file1.dvc
|- data_file2         <- different to repo1
|- data_file2.dvc     <- different to repo1

which will also be the latest version on the server.

Now, as a common habit, the first user with his repo1 directory will do a “git pull” and will have

repo1
|- data_file1
|- data_file1.dvc
|- data_file2      <- as originally present in repo1
|- data_file2.dvc  <- as uploaded by the second user, i.e identical to repo2

If this first user issues

dvc status data_file2.dvc

it will get with my dvc version 2.11.0 this output

data_file2.dvc:                                                 
	changed outs:
		not in cache:       data_file2

which is understandable (but still somewhat misleading), as indeed data_file2, as uploaded from repo2, is not in the local cache. If the first user then issues

dvc status --remote myremote

where myremote is the remote used by both users, the result is

deleted:            data_file2

This is however now confusing, as on the remote, the file was changed, but not deleted.

Am I missing something here?

Somewhat related in the same setting: What is the suggested way to get the latest version then of data_file2? Would it just be a general “dvc pull”?

But what if one has a huge repo and only did partial pulls on individual .dvc files. Would one then have to manually go through all these files? It feels like a “dvc update” that only updates already pulled contents would be good, here. But my take is that “dvc update” only does this for files that have been imported from another repo…

Thank you so much for your patience and your help!

@zaspel
Hi! thank you for the work you put into describing this use case.

This is however now confusing, as on the remote, the file was changed, but not deleted.

Agreed, but what you need to account for what status does:

$dvc status --help
Show changed stages, compare local cache and a remote storage.
...
  -r <name>, --remote <name>
                        Remote storage to compare local cache to

So the status shows the result comparing to local cache. As DVC versions the data on file level (we are saving whole fles, so if you modify it, it will be saved in cache as a new file), from the point of view of DVC, your local cache (in repo1) does not have, the “new” version of data_file2, hence “deleted”.

Somewhat related in the same setting: What is the suggested way to get the latest version then of data_file2? Would it just be a general “dvc pull”?

The safest option would be to pull the files that you intend to work on, if you don’t want to download whole repo content. You are right about dvc update.

Also, later down the road you will probably want so resolve conflicts, here are some suggestions: How to Resolve Merge Conflicts in DVC

Dear @Paffciu ,

thank you so much for the patience in your answer.

So the status shows the result comparing to local cache. As DVC versions the data on file level (we are saving whole fles, so if you modify it, it will be saved in cache as a new file), from the point of view of DVC, your local cache (in repo1) does not have, the “new” version of data_file2, hence “deleted”.

RTFM:

deleted means that the file/directory doesn’t exist in the cache, but exists in remote storage.

Still, for a newbie, this wording that implies that going from the remote to the local cache, there is a deletion, feels strange. However it is of course just a definition question of whether local vs. remote means

  1. local cache → remote
  2. remote → local cache

However now, I clearly get it! Thanks again!

Thanks also for the pointer to the merging of conflicts! I indeed went quite some times over the documentation but overlooked that particular page.

Thanks!

1 Like

Sure thing!
I agree that the wording can get a bit confusing, but that’s why we are here - to clear out any questions :slight_smile:
If you have more problems feel free to reach out to us.