Handling indeterministic output


#1

Hi,

first of all, very nice tool, thank you for that!

I am currently experimenting with dvc in a pytorch ML pipeline.
One step of the pipeline is training and pickling a model.

Usually training a model starts with random numbers, so if I do not set random seeds I will end up in a dfiferent model each training re-run, resulting in a new md5 hash for the model file each time.
Even if I set seeds, and the model ends up in the same values, I was not able (yet) with pytorch to pickle the model in a way that it exactly equals the previous run in terms of md5 hash.

Do you recommend any best practices regarding such kind of indeterministic/random output (for pytorch and/or in general)?
I think that there might arise problems when sharing pipelines but not data, or when having different metrics output.
Is it a problem at all? Is dvc capable of handling these things and if yes, how?

Thanks,

Alex


#2

Hi @arimale !

Thank you for a great question!

Do you recommend any best practices regarding such kind of indeterministic/random output (for pytorch and/or in general)?

One thing that might be useful for such cases is the ability to lock/unlock pipeline branch from reproducing. See our docs on lock/unlock commands.. I.e. it should be useful when you don’t want non-deterministic part of your pipeline to be ran again and instead only want to use the end result of it, e.g. the model that you’ve created.

I think that there might arise problems when sharing pipelines but not data, or when having different metrics output.
Is it a problem at all? Is dvc capable of handling these things and if yes, how?

Dvc can work without any cached data at all, as long as you provide the input files for your pipeline(i.e. cp /path/to/input/data data in your projects root) it is going to be fully reproduced. With metrics it seems totally normal that they will not match exactly in non-deterministic scenarios, and it as well doesn’t create any problems for the dvc itself, since it doesn’t care about them at all. Maybe I didn’t quite get the question, could you please elaborate on scenarios that concern you?

Thanks,
Ruslan


#3

Thanks for the great question!

I don’t see this to be a problem in general. Yes, as far as I know even with a hard-coded seed (and I would recommend it to make a train stage argument and pass it via command line to have better reproducibility) it’s almost impossible to get the same exact result - the same model, the same metric in a different environment. DVC can capture the state you have at the moment and help you get the same state (including metrics, models) on a different machine. It doesn’t mean that if you force it to run the pipeline again it’ll end up with an exact same model.

@arimale could you give us a little bit more details? How does your use case look like and why do you see this as a problem/concern (if you do)?


#4

Thanks for your replies! Helped me understand it better!

I was just wondering how much I need to have a pipeline behaving in a deterministic way, since the dvc files - as of my understanding - very much rely on md5 hashes. On the other hand, especially regarding metrics, I was thinking about total reproducibility in terms of exact same results, because it kind of ensures 100% that I did not make a mistake in reproducing and can safely build up or enhance on top of the current state. I totally understand this is naturally a hard goal, especially thinking of different environments. Nonetheless I thought it’s worth asking you about your thoughts :slight_smile: Thanks for the seed as argument idea, I think that should be sufficient to have same metrics, even when having different md5 sums in intermediate files.

As a more concrete scenario, I was thinking for example what happens if I just share the basic input data files, but no intermediate files. A collaborater then will re-run the pipeline, and by doing so, will have different md5 hashes in his local dvc files, probably differing to those checked-in with git. I guess this not a problem but I wasn’t sure.
As I haven’t tried remote dvc repositories yet, I was also wondering if different md5 hashes somehow might lead to conflicts when pushing or pulling.

I think I got the metrics part: the metrics themselves are not stored in the dvc file, but only the file name and xpath, thus every collaborator might see different metrics in the end (if pipeline is non-deterministic). Correct?
I was first thinking that metrics (the actual values) are stored in dvc file and thus checked-in with git…

Thanks,

Alex


#5

Hi @arimale !

As a more concrete scenario, I was thinking for example what happens if I just share the basic input data files, but no intermediate files. A collaborater then will re-run the pipeline, and by doing so, will have different md5 hashes in his local dvc files, probably differing to those checked-in with git. I guess this not a problem but I wasn’t sure.

Yes, it is totally not a problem for dvc. The fact that md5s will differ will only make dvc say that the pipeline has changed, since some dependencies/outputs have different md5s, but still your pipeline will be in a working order, just will produce a slightly different results simply because of the nature of non-deterministic scenario.

As I haven’t tried remote dvc repositories yet, I was also wondering if different md5 hashes somehow might lead to conflicts when pushing or pulling.

Not really. If your colleague will run dvc repro and will get different results, he will be able to safely push his data even to the same repository that you’ve used. Same when pulling, dvc will try to pull data that matches current md5s in the dvc files(which will be modified when you run dvc repo) and so it will not result in any conflicts.

I think I got the metrics part: the metrics themselves are not stored in the dvc file, but only the file name and xpath, thus every collaborator might see different metrics in the end (if pipeline is non-deterministic). Correct? I was first thinking that metrics (the actual values) are stored in dvc file and thus checked-in with git…

Correct. Dvc only remembers path, xpath and md5 checksum of the metrics file in the dvc file itself. The metrics file is currently recommended to be stored using git(i.e. -M option for dvc run doesn’t add the metrics file to dvc cache, leaving it for the user to commit to git) so that it is available even if you don’t have any data with the project(e.g. to view on github).

Thanks,
Ruslan


#6

Thank you again, this was really helpful!