I just tested DVC with some dummy data and have run into a situation which would be less than ideal for production. Is there an elegant way out?
Situation
- The workspace has only one DVC-tracked directory named “data/”
- There are two git branches. In both of them, we added/removed separate files to “data/”
- Now, we want to merge both branches
Expected outcome
- Unless there are conflicts, a simple git merge yields the union of the file operations in both branches
Actual outcome
- A git conflict in data.dcv. I can’t really merge, but only pick the data version in one of the branches
Given that the command “dvc diff” shows some very useful output, is there a way to merge both data versions semi-automatic? I have read the page https://dvc.org/doc/user-guide/how-to/merge-conflicts, but this only mentions the “append-only” strategy, not even mentioning dvc diff :(.
P.s.: As a side question: Can “dvc diff” detect and highlight renamed files (since they have the same hash value)?
Hi @gebbissimo !
As I understand those two datasets have modified and deleted data, not only new unique additions?
As noted in https://dvc.org/doc/user-guide/how-to/merge-conflicts we have a merge-driver that is able to automatically merge append-only datasets, but currently lacks functionality for a more involed conflict resolution We do plan on adding support for it in the future, so it would be great to figure out the scenario you have and functionality you expect from it.
//EDIT: Ah, I believe the issue is that it’s impossible to distinguish the following two situations without keeping track of the file operations history, or?
Situation 1
common_branch: A-AB
branch1: ABC (add C)
branch2: A (delete B)
expected merge: AC
Situation 2
common_branch: A
branch1: AB-ABC (add B,C)
branch2: AD-A (add D, then delete D)
expected merge: ABC
Thus, in such a case, one would need to decide for each file in DVC DIFF separately, whether to add or delete it. Hmm… I wonder how git LFS deals with such scenarios?
//EDIT2: Would it be possible to detect the difference of each branch w.r.t. the common git ancestor node and then get the union of those two differences? At least in the example above it should work, or?
P.s.: The other thing is that file renaming currently shows up in DVC DIFF <another_git_branch> as:
Added: (new filename after renaming)
Deleted: (old filename)
Would it be possible to group this information into a new category Renamed? Would make the overview more helpful in my view.
Thanks again!
1 Like
Got it. So clearly those kinds of scenarios would be hard to do automatically. And we need to provide a convenient way to resolve conflicts in your datasets by hand. We will be introducing it in the future.
I wonder how git LFS deals with such scenarios?
It doesn’t store datasets as a whole, rather tracking each individual file, so the conflicts with added/deleted files are resolved the same way regular git does. With modified files though, it will prompt you into an editor with a old and new hash of that file. We hope to be able to provide something more handy, especially considering that in our scenarios datasets might consist of many millions of files.
Yep, there will be some common strategies to choose from for our merge-driver. Right now it only supports unions of append-only dirs, which might get adjusted to be smarter in the scenarios you’ve described. The current implementation of merge-driver is more of a POC that we’ve created to cover a basic scenario and were waiting for users like you to request additional features
Great idea! Mind creating a feature request on Issues · iterative/dvc · GitHub , please?
1 Like
Great answer, thanks. Opened two new feature requests. Can probably close this thread now.
for the “Renamed” group: https://github.com/iterative/dvc/issues/5150
for the merging strategy based on the union of the differences: https://github.com/iterative/dvc/issues/5151
1 Like