Hi @arimale !
As a more concrete scenario, I was thinking for example what happens if I just share the basic input data files, but no intermediate files. A collaborater then will re-run the pipeline, and by doing so, will have different md5 hashes in his local dvc files, probably differing to those checked-in with git. I guess this not a problem but I wasn’t sure.
Yes, it is totally not a problem for dvc. The fact that md5s will differ will only make dvc say that the pipeline has changed, since some dependencies/outputs have different md5s, but still your pipeline will be in a working order, just will produce a slightly different results simply because of the nature of non-deterministic scenario.
As I haven’t tried remote dvc repositories yet, I was also wondering if different md5 hashes somehow might lead to conflicts when pushing or pulling.
Not really. If your colleague will run
dvc repro and will get different results, he will be able to safely push his data even to the same repository that you’ve used. Same when pulling, dvc will try to pull data that matches current md5s in the dvc files(which will be modified when you run
dvc repo) and so it will not result in any conflicts.
I think I got the metrics part: the metrics themselves are not stored in the dvc file, but only the file name and xpath, thus every collaborator might see different metrics in the end (if pipeline is non-deterministic). Correct? I was first thinking that metrics (the actual values) are stored in dvc file and thus checked-in with git…
Correct. Dvc only remembers path, xpath and md5 checksum of the metrics file in the dvc file itself. The metrics file is currently recommended to be stored using git(i.e. -M option for
dvc run doesn’t add the metrics file to dvc cache, leaving it for the user to commit to git) so that it is available even if you don’t have any data with the project(e.g. to view on github).