Tracking data and code dependencies

ale.re · May 16, 2018, 10:31am

Hello!
I am currently evaluating tools for keeping track of data in our workflow, and I am looking at how DVC works. First off, thanks for developing it, Dmitry, as it is a very nice tool.

I have read the tutorial on your blog and there are a couple of question I have, though, which I suspect are quite critical.

First: the use of hard links is a great idea when it comes to speed, but what happens if one changes the data file? I tried and the dvc status reports “corrupted cache data”, as the new hash is different from the saved one. Apparently, there is no way of going back, is dvc assuming that data are immutable? Because it might be in some use cases, but not in general. Also, mistakes might happen.

Second, the run functionality is extremely useful, but it appears to be as much fragile. Apparently, dvc is not tied to git when it comes to tracking which files are used for a specific run: I have to specify the dependency to source code using the -d flag to make runs reproducible. What happens if I do not specify a file that is actually being used? What happens if that file changes and a new run is executed? Apparently, dvc does not check for differences and no message or warning is raised. repro can be forced, and results change, but no warning is raised even in this case.

So, I am interested into knowing if I am using the tool in an inappropriate way: am I missing something in DVC philosophy?

Thanks!
~Alessandro

kupruser · May 16, 2018, 11:10am

Hi Alessandro!

What we really want are reflinks(essentially CoW files), but they are not available on every fs as of right now. In dvc 0.9.7(which is going to be released this week) dvc tries all possibilities: reflink/hardlink/symlink/copy until it finds something that works. Yes, when hardlinks are used(or symlinks), dvc assumes that you won’t modify data by-hand without re-adding it, so if it detects that linked cache file was suddenly changed, it just removes it and tells user that there was a corrupted cache detected. Unfortunately there is no efficient way around it, except using reflinks, which are fortunately coming to every modern fs, so we will be able to use them exclusively in the future.

We previously had an auto-magical system in place, that tracked new files after the run and automatically added them as outputs to dvc file. Unfortunately, that system was proven to be too implicit, which resulted in often unexpected behavior, as only the user truly knows which files he considers as outputs and which he considers as dependencies. Thus we implemented the current explicit system, so that users know for sure which files are dependencies and which are outputs for their stage.

If you don’t specify any dependencies, dvc will always treat your stage as changed and will always reproduce it. If you don’t specify particular dependency, then you will have it hidden from dvc, so dvc will not be able to track it and reproduce your pipeline when that dependency is changed. Dvc can’t know which files you want to track(or be warned about) if you don’t specify them, so ‘repro’ and especially ‘repro --force’ behavior is totally normal

It definitely seems like dvc is an appropriate tool for your project. Maybe you could elaborate on your scenario, so we could suggest you some tips and tricks?

Cheers,
Ruslan

ale.re · May 16, 2018, 11:53am

Hello Ruslan,
thank you for your wonderful answer!

First, it’s wonderful to know that reflinks are coming, but I wonder if our platforms are supported. I searched on the Github issue tracker to read about it (Issue #280) but it is not clear to me if copy can be used instead (by default). I’d rather trade speed for safety on this, so it would be great if DVC could be configured to prefer reflink over copy over hard/softlink.

Regarding the auto-magical system to track new files: yes, I read about that somewhere and I agree the current system is better. Explicit is better than implicit

On this, I must admit that, in some ways, I prefer the approach that Pachyderm is using: having a virtual filesystem where data is exposed, and applications are free to read/write from there to have automatic tracking of outputs. With DVC, docker is not necessary (as it is in Pachyderm) so a thin virtual filesystem layer would be sufficient, but I don’t know if this is portable (FUSE on Linux, BSD and Mac, but I don’t know what’s available on Windows).
I think that would be a nice alternative to how DVC is currently operating, but I understand it might take too much dev effort.

Regarding the tracking: thanks for letting me know. I’ll grok this, then ask again if something is still unclear

Regarding our use case: it’s very similar to this, but it’s kind of complex as we have some requirements regarding authentication and data access. I think it’s better to start a new conversation, but I’ll take some time to learn more about DVC before doing that.

Thanks!
~Alessandro

kupruser · May 16, 2018, 12:34pm

And thank you for the feedback, we really appreciate it!

Your safety concerns are totally understandable and starting from 0.9.7 you will be able to select default type of links in the config. E.g. to select ‘copy’ you would need to to run dvc config cache.type copy. Unfortunately, as of 0.9.7, you will not be able to alter the order of preferred types, but I totally agree that it is indeed a great idea and we will definitely consider adding support for it in 0.9.8(https://github.com/dataversioncontrol/dvc/issues/709).

We have actually considered adding docker and/or FUSE support to provide another level of isolation and reproducibility, but just didn’t have time to get to it. But we will definitely look into it in the future!

Sure, feel free to ping us, we are happy to help

Cheers,
Ruslan

dmitry · May 18, 2018, 12:11am

Alessandro, thank you for the questions! Ruslan, thanks for the answers!

You are right about immutable data files. The new copy semantic can solve the issue with some negative impact on performance (you will feel that with 5Gb+ file size). And, as Ruslan described, Reflink is a way to solve the issue without the impact on performance.

This feature was just released: DVC 0.9.7 release

Thanks,
Dmitry

Topic		Replies	Views
DVC 0.9.7 release Announcements	0	950	May 17, 2018
How to avoid data duplication between cache and workspace Questions	5	1141	October 23, 2023
DVC 0.9 release Announcements	0	830	March 27, 2018
Unexpected "changed outs" on status, and "link type reflink is not available" on pull Questions	8	959	August 25, 2023
Add'ed files are duplicated in the cache, no links Questions	2	1003	August 26, 2019

Tracking data and code dependencies

Related topics