DVC-hash and PyTorch files

pbardsle · June 6, 2025, 5:47pm

Hello,

I am looking into building a CI/CD for a project that essentially runs a (tiny) PyTorch model training end-to-end. This job should essentially let us know if a code/data change has modified anything in the training pipeline. My question comes down to the hash (md5 IIRC) computed on saved files, and it appears PyTorch-saved files store some form of metadata that changes between runs (e.g., timestamps, filepaths, etc.). I’m having a hard time finding any resources on DVC compatibility with torch.save() generated files. Can someone point me in the right direction?

I’m guessing this has been encountered by someone here before and curious how it was solved. If possible, I’d rather not write a custom weights/model writer to preserve file-hashes between runs.

Best,
Patrick

skshetry · June 10, 2025, 10:10am

Unfortunately, this is not possible. You will have to figure out how to generate reproducible artifacts.

Reproducibility — PyTorch 2.7 documentation

Alternatvely, you could generate an additional output/artifact that changes when the code/data is actually updated.

pbardsle · June 10, 2025, 1:29pm

Thank you for the response and the info — I was slowly coming to that realization as well so it is nice to have the confirmation. JSONifying the state-dict will probably be the way I move forward and should be easy enough for our purposes.

Thanks again!

Topic		Replies	Views
Handling indeterministic output Questions	5	1259	August 27, 2018
Tracking data and code dependencies Questions	4	2136	May 18, 2018
Best way to pinpoint per-file changes in 300k+ files? Questions	4	66	February 26, 2025
DVC run and add: store command and data Questions	2	658	August 16, 2018
Dvc metrics diff on metrics stored in dvc-cache? Questions	2	1053	May 18, 2020

DVC-hash and PyTorch files

Related topics