DVC removes files after pull

I have problem with pulling files defined in a pipeline.

Running fetch command

$ dvc fetch dvc.yaml:correct_h5@5 -v 
2024-08-14 15:32:18,325 DEBUG: v3.53.1 (pip), CPython 3.10.12 on Linux-5.15.0-71-generic-x86_64-with-glibc2.35
2024-08-14 15:32:18,325 DEBUG: command: /home/ermolaev/projects/radml/venv/bin/dvc fetch dvc.yaml:correct_h5@5 -v
2024-08-14 15:32:18,898 DEBUG: Lockfile '../02_seg/dvc.lock' needs to be updated.
Collecting                                                                                                                        |16.0 [00:00, 34.2entry/s]
2024-08-14 15:32:23,847 DEBUG: Preparing to transfer data from 'gdrive://1GkI-Tgwdi1r2oat1E0wxFG3sXc6oL2UL/files/md5' to '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5'
2024-08-14 15:32:23,847 DEBUG: Preparing to collect status from '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5'
2024-08-14 15:32:23,848 DEBUG: Collecting status from '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5'
2024-08-14 15:32:23,850 DEBUG: Preparing to collect status from '1GkI-Tgwdi1r2oat1E0wxFG3sXc6oL2UL/files/md5'
2024-08-14 15:32:23,850 DEBUG: Collecting status from '1GkI-Tgwdi1r2oat1E0wxFG3sXc6oL2UL/files/md5'                                                         
2024-08-14 15:32:23,856 DEBUG: Querying 27 oids via object_exists
2024-08-14 15:32:31,193 DEBUG: Querying 0 oids via object_exists                                                                                            
2024-08-14 15:32:34,263 DEBUG: Querying 1 oids via object_exists                                                                                            
Computing md5 for a large file '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5/15/e440afe8bb77ed66da4ea829008023'. This is only done once.                                                                                                                                                        
2024-08-14 15:36:51,238 DEBUG: corrupted cache file '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5/15/e440afe8bb77ed66da4ea829008023'.                                                                                                                                                           
2024-08-14 15:36:51,239 DEBUG: Removing '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5/15/e440afe8bb77ed66da4ea829008023'            
2024-08-14 15:36:52,442 DEBUG: Preparing to transfer data from 'gdrive://1GkI-Tgwdi1r2oat1E0wxFG3sXc6oL2UL' to '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache'                                                                                                                                             
2024-08-14 15:36:52,442 DEBUG: Preparing to collect status from '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache'
2024-08-14 15:36:52,442 DEBUG: Collecting status from '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache'
Fetching
1 file fetched
2024-08-14 15:36:52,453 DEBUG: Analytics is enabled.
2024-08-14 15:36:52,511 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmpa4fq7y13', '-v']
2024-08-14 15:36:52,520 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmpa4fq7y13', '-v'] with pid 26890

Look at the result

$ ls ../../data/full_datasets/fractures_0424_seg/h5-corrected/test/ -al

And we can that link is broken

But monitoring this cache folder during fetching shows that file was downloaded but removed later:

ermolaev@e4b19654972b:~/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5/15$ du -hs ./*
18M     ./4f4b9d51f80aa62418fea1b3e224ca
1.8M    ./69abe38069fc9f99f8f995068683fd
4.0K    ./a0e72e8fdd621498eaf87c967403ac.dir
4.0K    ./d571117c7402aba36330ffed4e76fe.dir
4.0K    ./f2a2517da2527a0a9d737e9f63ece6.dir
ermolaev@e4b19654972b:~/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5/15$ du -hs ./*
18M     ./4f4b9d51f80aa62418fea1b3e224ca
1.8M    ./69abe38069fc9f99f8f995068683fd
4.0K    ./a0e72e8fdd621498eaf87c967403ac.dir
4.0K    ./d571117c7402aba36330ffed4e76fe.dir
12G     ./e440afe8bb77ed66da4ea829008023
4.0K    ./f2a2517da2527a0a9d737e9f63ece6.dir
ermolaev@e4b19654972b:~/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5/15$ du -hs ./*
18M     ./4f4b9d51f80aa62418fea1b3e224ca
1.8M    ./69abe38069fc9f99f8f995068683fd
4.0K    ./a0e72e8fdd621498eaf87c967403ac.dir
4.0K    ./d571117c7402aba36330ffed4e76fe.dir
12G     ./e440afe8bb77ed66da4ea829008023
4.0K    ./f2a2517da2527a0a9d737e9f63ece6.dir
ermolaev@e4b19654972b:~/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5/15$ du -hs ./*
18M     ./4f4b9d51f80aa62418fea1b3e224ca
1.8M    ./69abe38069fc9f99f8f995068683fd
4.0K    ./a0e72e8fdd621498eaf87c967403ac.dir
4.0K    ./d571117c7402aba36330ffed4e76fe.dir
4.0K    ./f2a2517da2527a0a9d737e9f63ece6.dir
ermolaev@e4b19654972b:~/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5/15$ du -hs ./*
18M     ./4f4b9d51f80aa62418fea1b3e224ca
1.8M    ./69abe38069fc9f99f8f995068683fd
4.0K    ./a0e72e8fdd621498eaf87c967403ac.dir
4.0K    ./d571117c7402aba36330ffed4e76fe.dir
4.0K    ./f2a2517da2527a0a9d737e9f63ece6.dir

It looks very very strange. Why DVC decided to remove already downloaded file from cache? Why some files have been delited while others stayed?

The answer is in the logs:

2024-08-14 15:36:51,238 DEBUG: corrupted cache file '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5/15/e440afe8bb77ed66da4ea829008023'.                                                                                                                                                           

Looks like a corrupted file got pushed into the remote.

I see, but file is correct. I have connection to the server, where this file is in cache. I repeated dvc push and everything is fine. I’ve downloaded it manually from the cache and it’s fine.

Why even DVC removes such files? What “corrupted” means? What are the criterios? And why DVC didn’t raise any error? If I run the same command without “-v” flag I wouldn’t ever known about this problem.

Moreover, running the same command on other machine gives correct results

$ dvc pull dvc.yaml:correct_h5@5 -v
2024-08-14 16:27:41,309 DEBUG: v3.53.1 (pip), CPython 3.10.14 on Linux-6.8.0-31-generic-x86_64-with-glibc2.39
2024-08-14 16:27:41,309 DEBUG: command: /home/ermolaev/projects/radml/venv/bin/dvc pull dvc.yaml:correct_h5@5 -v
2024-08-14 16:27:42,229 DEBUG: Lockfile '../05_unf_dt/dvc.lock' needs to be updated.
2024-08-14 16:27:42,404 DEBUG: Lockfile for '../06_cage_sgm/dvc.yaml' not found
Collecting                                                                                                                        |16.0 [00:00, 60.6entry/s]
2024-08-14 16:27:45,287 DEBUG: Preparing to transfer data from 'gdrive://1GkI-Tgwdi1r2oat1E0wxFG3sXc6oL2UL/files/md5' to '/home/ermolaev/projects/radml/content/DVC_CACHE/files/md5'
2024-08-14 16:27:45,288 DEBUG: Preparing to collect status from '/home/ermolaev/projects/radml/content/DVC_CACHE/files/md5'
2024-08-14 16:27:45,288 DEBUG: Collecting status from '/home/ermolaev/projects/radml/content/DVC_CACHE/files/md5'
2024-08-14 16:27:45,297 DEBUG: Preparing to transfer data from 'gdrive://1GkI-Tgwdi1r2oat1E0wxFG3sXc6oL2UL' to '/home/ermolaev/projects/radml/content/DVC_CACHE/'                                                                                                                                                       
2024-08-14 16:27:45,297 DEBUG: Preparing to collect status from '/home/ermolaev/projects/radml/content/DVC_CACHE/'
2024-08-14 16:27:45,297 DEBUG: Collecting status from '/home/ermolaev/projects/radml/content/DVC_CACHE/'
Fetching
Computing md5 for a large file '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/ribs/data/full_datasets/fractures_0424_seg/h5-corrected/test/bin.h5'. This is only done once.
Building workspace index                                                                                                          |24.0 [00:17, 1.41entry/s]
Comparing indexes                                                                                                                |22.0 [00:00, 3.63kentry/s]
Applying changes                                                                                                                  |0.00 [00:00,     ?file/s]
Everything is up to date.
2024-08-14 16:28:02,370 DEBUG: Analytics is enabled.
2024-08-14 16:28:02,393 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmpf7hssrau', '-v']
2024-08-14 16:28:02,399 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmpf7hssrau', '-v'] with pid 117744