How to identify data corruption?

Hi!
So I’ve been using DVC with an S3 remote and for some reason it happened that a file was uploaded only partially to S3. When I pulled it, DVC didn’t complain however. I noticed the file was shorter than it should have been and the md5sum didn’t match the md5 that’s in the .dvc file.
I tried things like “dvc status”, “dvc status -c” but none of them complained.

Even re-adding the file didn’t fix the issue. In the end I manually deleted the file on S3 and then re-added it. That worked, but it was not a good experience. Also I only noticed this because my code complained.

I was shocked that DVC apparently didn’t calculate the checksum and verified it matches the one in the .dvc file. Is it supposed to be like this?
Is there a way to check all my files for corruption?

Hi, you can set dvc remote modify myremote verify true to check each time that the checksum matches. See remote modify.

1 Like

@tfriedel
Normally, DVC verifies checksums during operations like dvc pull or dvc status -c. To check for file corruption, manually use dvc checkout to fetch and verify each file against its .dvc metadata file. Reviewing DVC configuration and logs may help identify any issues. For further assistance, consider contacting DVC support or community forums. They can offer insights into ensuring data integrity with DVC and S3.

Hello,
DVC should verify checksums by default when pulling from S3. If you notice issues like partial uploads or checksum mismatches without warnings, use dvc check or dvc status -c to manually verify data integrity.