Some background is that we pull binary build files from an S3 bucket. When our machines crash mid-pull we find that DVC will not pull the files on the next go. This is because the DVC cache tells DVC it’s up-to-date, even though the files are corrupted/incomplete.
My intuitive thought would be that the DVC cache would get updated once the pull is confirmed successful. Is there a reason DVC updates the cache prior to completing a pull?
I use 2.47.0 and a more accurate statement would probably be that the files are incomplete. When I try using the program that uses these files it reports them as corrupted. I don’t believe they’re actually corrupted, just missing data.
Sorry, if I’m missing information about the situation. Please feel free to ask for more information and I can respond with whatever I can.
Thanks for the details. Could you please share the complete output from the dvc version please? Also, do you use any special DVC config options (.dvc/config and .dvc/config.local files). What I’m trying to understand if the file is corrupted in DVC cache or in the workspace or both.
I don’t have the exact output since this only happened when my computer froze midway through a dvc pull. From what I remember, the error printout would ask to try and update the DVC cache but didn’t give a command to do so and I think only a link. I’ve since solved the problem by using dvc destroy -f and replacing git tracked files it removed using git stash -u. I was then able to pull again and my binary files would then all be there. Before doing dvc destroy -f; git stash -u it wouldn’t try to pull anything (which is why I assumed that the cache was preventing this).
The only files I can think of that were removed and solved my problem were the cache files.
I am using a .dvc/config, I don’t seem to be using a .dvc/config.local