DVC Cache causing problems when a pull is corrupted

Some background is that we pull binary build files from an S3 bucket. When our machines crash mid-pull we find that DVC will not pull the files on the next go. This is because the DVC cache tells DVC it’s up-to-date, even though the files are corrupted/incomplete.

My intuitive thought would be that the DVC cache would get updated once the pull is confirmed successful. Is there a reason DVC updates the cache prior to completing a pull?

even though the files are corrupted/incomplete.

@delano could you please clarify? how do you see that these files are corrupted?

What version of DVC do you use? Could you share some details - dvc version.

I use 2.47.0 and a more accurate statement would probably be that the files are incomplete. When I try using the program that uses these files it reports them as corrupted. I don’t believe they’re actually corrupted, just missing data.

Sorry, if I’m missing information about the situation. Please feel free to ask for more information and I can respond with whatever I can.

Thanks for the details. Could you please share the complete output from the dvc version please? Also, do you use any special DVC config options (.dvc/config and .dvc/config.local files). What I’m trying to understand if the file is corrupted in DVC cache or in the workspace or both.

I don’t have the exact output since this only happened when my computer froze midway through a dvc pull. From what I remember, the error printout would ask to try and update the DVC cache but didn’t give a command to do so and I think only a link. I’ve since solved the problem by using dvc destroy -f and replacing git tracked files it removed using git stash -u. I was then able to pull again and my binary files would then all be there. Before doing dvc destroy -f; git stash -u it wouldn’t try to pull anything (which is why I assumed that the cache was preventing this).

The only files I can think of that were removed and solved my problem were the cache files.

I am using a .dvc/config, I don’t seem to be using a .dvc/config.local

Hey, thanks! I meant the complete dvc version or dvc doctor output. That would be helpful, thanks!

Did you try btw before the destroy to do git stash -u and then dvc pull?