I have problem with pulling files defined in a pipeline.
Running fetch command
$ dvc fetch dvc.yaml:correct_h5@5 -v
2024-08-14 15:32:18,325 DEBUG: v3.53.1 (pip), CPython 3.10.12 on Linux-5.15.0-71-generic-x86_64-with-glibc2.35
2024-08-14 15:32:18,325 DEBUG: command: /home/ermolaev/projects/radml/venv/bin/dvc fetch dvc.yaml:correct_h5@5 -v
2024-08-14 15:32:18,898 DEBUG: Lockfile '../02_seg/dvc.lock' needs to be updated.
Collecting |16.0 [00:00, 34.2entry/s]
2024-08-14 15:32:23,847 DEBUG: Preparing to transfer data from 'gdrive://1GkI-Tgwdi1r2oat1E0wxFG3sXc6oL2UL/files/md5' to '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5'
2024-08-14 15:32:23,847 DEBUG: Preparing to collect status from '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5'
2024-08-14 15:32:23,848 DEBUG: Collecting status from '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5'
2024-08-14 15:32:23,850 DEBUG: Preparing to collect status from '1GkI-Tgwdi1r2oat1E0wxFG3sXc6oL2UL/files/md5'
2024-08-14 15:32:23,850 DEBUG: Collecting status from '1GkI-Tgwdi1r2oat1E0wxFG3sXc6oL2UL/files/md5'
2024-08-14 15:32:23,856 DEBUG: Querying 27 oids via object_exists
2024-08-14 15:32:31,193 DEBUG: Querying 0 oids via object_exists
2024-08-14 15:32:34,263 DEBUG: Querying 1 oids via object_exists
Computing md5 for a large file '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5/15/e440afe8bb77ed66da4ea829008023'. This is only done once.
2024-08-14 15:36:51,238 DEBUG: corrupted cache file '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5/15/e440afe8bb77ed66da4ea829008023'.
2024-08-14 15:36:51,239 DEBUG: Removing '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5/15/e440afe8bb77ed66da4ea829008023'
2024-08-14 15:36:52,442 DEBUG: Preparing to transfer data from 'gdrive://1GkI-Tgwdi1r2oat1E0wxFG3sXc6oL2UL' to '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache'
2024-08-14 15:36:52,442 DEBUG: Preparing to collect status from '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache'
2024-08-14 15:36:52,442 DEBUG: Collecting status from '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache'
Fetching
1 file fetched
2024-08-14 15:36:52,453 DEBUG: Analytics is enabled.
2024-08-14 15:36:52,511 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmpa4fq7y13', '-v']
2024-08-14 15:36:52,520 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmpa4fq7y13', '-v'] with pid 26890
Look at the result
$ ls ../../data/full_datasets/fractures_0424_seg/h5-corrected/test/ -al
And we can that link is broken
But monitoring this cache folder during fetching shows that file was downloaded but removed later:
ermolaev@e4b19654972b:~/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5/15$ du -hs ./*
18M ./4f4b9d51f80aa62418fea1b3e224ca
1.8M ./69abe38069fc9f99f8f995068683fd
4.0K ./a0e72e8fdd621498eaf87c967403ac.dir
4.0K ./d571117c7402aba36330ffed4e76fe.dir
4.0K ./f2a2517da2527a0a9d737e9f63ece6.dir
ermolaev@e4b19654972b:~/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5/15$ du -hs ./*
18M ./4f4b9d51f80aa62418fea1b3e224ca
1.8M ./69abe38069fc9f99f8f995068683fd
4.0K ./a0e72e8fdd621498eaf87c967403ac.dir
4.0K ./d571117c7402aba36330ffed4e76fe.dir
12G ./e440afe8bb77ed66da4ea829008023
4.0K ./f2a2517da2527a0a9d737e9f63ece6.dir
ermolaev@e4b19654972b:~/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5/15$ du -hs ./*
18M ./4f4b9d51f80aa62418fea1b3e224ca
1.8M ./69abe38069fc9f99f8f995068683fd
4.0K ./a0e72e8fdd621498eaf87c967403ac.dir
4.0K ./d571117c7402aba36330ffed4e76fe.dir
12G ./e440afe8bb77ed66da4ea829008023
4.0K ./f2a2517da2527a0a9d737e9f63ece6.dir
ermolaev@e4b19654972b:~/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5/15$ du -hs ./*
18M ./4f4b9d51f80aa62418fea1b3e224ca
1.8M ./69abe38069fc9f99f8f995068683fd
4.0K ./a0e72e8fdd621498eaf87c967403ac.dir
4.0K ./d571117c7402aba36330ffed4e76fe.dir
4.0K ./f2a2517da2527a0a9d737e9f63ece6.dir
ermolaev@e4b19654972b:~/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5/15$ du -hs ./*
18M ./4f4b9d51f80aa62418fea1b3e224ca
1.8M ./69abe38069fc9f99f8f995068683fd
4.0K ./a0e72e8fdd621498eaf87c967403ac.dir
4.0K ./d571117c7402aba36330ffed4e76fe.dir
4.0K ./f2a2517da2527a0a9d737e9f63ece6.dir
It looks very very strange. Why DVC decided to remove already downloaded file from cache? Why some files have been delited while others stayed?