I am currently evaluating tools for keeping track of data in our workflow, and I am looking at how DVC works. First off, thanks for developing it, Dmitry, as it is a very nice tool.
I have read the tutorial on your blog and there are a couple of question I have, though, which I suspect are quite critical.
First: the use of hard links is a great idea when it comes to speed, but what happens if one changes the data file? I tried and the
dvc status reports “corrupted cache data”, as the new hash is different from the saved one. Apparently, there is no way of going back, is dvc assuming that data are immutable? Because it might be in some use cases, but not in general. Also, mistakes might happen.
Second, the run functionality is extremely useful, but it appears to be as much fragile. Apparently, dvc is not tied to git when it comes to tracking which files are used for a specific run: I have to specify the dependency to source code using the -d flag to make runs reproducible. What happens if I do not specify a file that is actually being used? What happens if that file changes and a new run is executed? Apparently, dvc does not check for differences and no message or warning is raised. repro can be forced, and results change, but no warning is raised even in this case.
So, I am interested into knowing if I am using the tool in an inappropriate way: am I missing something in DVC philosophy?