Merging of files in DVC-tracked directories

gebbissimo · December 22, 2020, 1:32pm

I just tested DVC with some dummy data and have run into a situation which would be less than ideal for production. Is there an elegant way out?

Situation

The workspace has only one DVC-tracked directory named “data/”
There are two git branches. In both of them, we added/removed separate files to “data/”
Now, we want to merge both branches

Expected outcome

Unless there are conflicts, a simple git merge yields the union of the file operations in both branches

Actual outcome

A git conflict in data.dcv. I can’t really merge, but only pick the data version in one of the branches

Given that the command “dvc diff” shows some very useful output, is there a way to merge both data versions semi-automatic? I have read the page https://dvc.org/doc/user-guide/how-to/merge-conflicts, but this only mentions the “append-only” strategy, not even mentioning dvc diff :(.

P.s.: As a side question: Can “dvc diff” detect and highlight renamed files (since they have the same hash value)?

kupruser · December 22, 2020, 6:56pm

Hi @gebbissimo !

As I understand those two datasets have modified and deleted data, not only new unique additions?

As noted in https://dvc.org/doc/user-guide/how-to/merge-conflicts we have a merge-driver that is able to automatically merge append-only datasets, but currently lacks functionality for a more involed conflict resolution We do plan on adding support for it in the future, so it would be great to figure out the scenario you have and functionality you expect from it.

gebbissimo · December 22, 2020, 8:47pm

//EDIT: Ah, I believe the issue is that it’s impossible to distinguish the following two situations without keeping track of the file operations history, or?

Situation 1
common_branch: A-AB
branch1: ABC (add C)
branch2: A (delete B)
expected merge: AC

Situation 2
common_branch: A
branch1: AB-ABC (add B,C)
branch2: AD-A (add D, then delete D)
expected merge: ABC

Thus, in such a case, one would need to decide for each file in DVC DIFF separately, whether to add or delete it. Hmm… I wonder how git LFS deals with such scenarios?

//EDIT2: Would it be possible to detect the difference of each branch w.r.t. the common git ancestor node and then get the union of those two differences? At least in the example above it should work, or?

P.s.: The other thing is that file renaming currently shows up in DVC DIFF <another_git_branch> as:

Added: (new filename after renaming)
Deleted: (old filename)

Would it be possible to group this information into a new category Renamed? Would make the overview more helpful in my view.

Thanks again!

kupruser · December 22, 2020, 9:33pm

Got it. So clearly those kinds of scenarios would be hard to do automatically. And we need to provide a convenient way to resolve conflicts in your datasets by hand. We will be introducing it in the future.

I wonder how git LFS deals with such scenarios?

It doesn’t store datasets as a whole, rather tracking each individual file, so the conflicts with added/deleted files are resolved the same way regular git does. With modified files though, it will prompt you into an editor with a old and new hash of that file. We hope to be able to provide something more handy, especially considering that in our scenarios datasets might consist of many millions of files.

Yep, there will be some common strategies to choose from for our merge-driver. Right now it only supports unions of append-only dirs, which might get adjusted to be smarter in the scenarios you’ve described. The current implementation of merge-driver is more of a POC that we’ve created to cover a basic scenario and were waiting for users like you to request additional features

Great idea! Mind creating a feature request on Issues · iterative/dvc · GitHub , please?

gebbissimo · December 22, 2020, 10:00pm

Great answer, thanks. Opened two new feature requests. Can probably close this thread now.

for the “Renamed” group: https://github.com/iterative/dvc/issues/5150

for the merging strategy based on the union of the differences: https://github.com/iterative/dvc/issues/5151

Topic		Replies	Views
What DVC does when git merge is executed? Questions	5	474	September 1, 2022
Odd merge behavior Questions	3	495	August 10, 2023
DVC conflict may arrive if multiple user working on same repository? Questions	2	1045	September 3, 2018
This example in the docs seems a bit odd, and a basic question Questions	1	262	March 30, 2023
Need help with merging (not conflicts) Questions	2	321	September 12, 2023

Merging of files in DVC-tracked directories

Situation

Expected outcome

Actual outcome

Related topics