DVC-hash and PyTorch files

Hello,

I am looking into building a CI/CD for a project that essentially runs a (tiny) PyTorch model training end-to-end. This job should essentially let us know if a code/data change has modified anything in the training pipeline. My question comes down to the hash (md5 IIRC) computed on saved files, and it appears PyTorch-saved files store some form of metadata that changes between runs (e.g., timestamps, filepaths, etc.). I’m having a hard time finding any resources on DVC compatibility with torch.save() generated files. Can someone point me in the right direction?

I’m guessing this has been encountered by someone here before and curious how it was solved. If possible, I’d rather not write a custom weights/model writer to preserve file-hashes between runs.

Best,
Patrick

Unfortunately, this is not possible. You will have to figure out how to generate reproducible artifacts.

Alternatvely, you could generate an additional output/artifact that changes when the code/data is actually updated.

Thank you for the response and the info — I was slowly coming to that realization as well so it is nice to have the confirmation. JSONifying the state-dict will probably be the way I move forward and should be easy enough for our purposes.

Thanks again!