Summary: I’m running
dvc pull on a .dvc file that references a directory cache file, the cache file does not exist in the local cache, but dvc says everything is ok.
I don’t have a MWE for this. I think something is screwed up in my DVC cache, but I’m not sure and I’d like to see if anyone knows of this behavior I’m experiencing.
First here is my
DVC version: 3.2.5.dev1+ge29da6eae.d20230701 -------------------------------------------- Platform: Python 3.11.2 on Linux-5.19.0-45-generic-x86_64-with-glibc2.35 Subprojects: dvc_data = 2.3.2.dev1+gde5c685 dvc_objects = 0.23.1.dev1+g76c3665 dvc_render = 0.3.1 dvc_task = 0.3.0 scmrepo = 1.0.4 Supports: azure (adlfs = 2023.4.0, knack = 0.10.1, azure-identity = 1.12.0), gdrive (pydrive2 = 1.15.4), gs (gcsfs = 2023.6.0), hdfs (fsspec = 2023.6.0, pyarrow = 11.0.0), http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3), https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3), oss (ossfs = 2021.8.0), s3 (s3fs = 2023.6.0, boto3 = 1.26.76), ssh (sshfs = 2023.4.1), webdav (webdav4 = 0.9.8), webdavs (webdav4 = 0.9.8), webhdfs (fsspec = 2023.6.0) Config: Global: /home/joncrall/.config/dvc System: /etc/xdg/xdg-ubuntu/dvc Cache types: reflink, hardlink, symlink Cache directory: btrfs on /dev/nvme1n1 Caches: local Remotes: s3, ssh, local, ssh, ssh, ssh, local, local, ssh, ssh, ssh, ssh, ssh, ssh, ssh, ssh, ssh, local Workspace directory: btrfs on /dev/nvme1n1 Repo: dvc, git Repo.site_cache_dir: /var/tmp/dvc/repo/e5b7dab284e2ab932c373f6d94804a2b
Now the setup:
I have two local DVC repos. One is on an HDD (zfs RAID-10) and the other is on an SSD (btrfs single disk). The different part of the dvc doctor for the HDD repo is:
> Cache types: hardlink, symlink > Cache directory: zfs on data
The SSD repo has a local remote setup called
hdd to point at the the local path to the HDD cache repo.
Awhile back in the HDD repo I ran
dvc add on some folders, and everything looked fine. At some point I called
dvc pull -r hdd -R . to get a specific subdirectory from my HDD onto my SSD, but after I noticed that some files seemed to be missing.
I found the missing file. There should have been a folder called
S2 associted with
S2.dvc, but it wasn’t there. On the SSD I ran
dvc pull -r hdd S2.dvc, and the it said:
Everything is up to date.,
but after the folder was still missing. I reran with
-vv and got this additoinal output:
... bunch of collect stages trace 2023-07-03 11:21:43,986 DEBUG: Preparing to transfer data from '/data/joncrall/dvc-repos/smart_data_dvc-hdd/.dvc/cache' to '/media/joncrall/flash1/smart_data_dvc/.dvc/cache' 2023-07-03 11:21:43,986 DEBUG: Preparing to collect status from '/media/joncrall/flash1/smart_data_dvc/.dvc/cache' 2023-07-03 11:21:43,986 DEBUG: Collecting status from '/media/joncrall/flash1/smart_data_dvc/.dvc/cache' 2023-07-03 11:21:43,990 DEBUG: Preparing to transfer data from '/data/joncrall/dvc-repos/smart_data_dvc-hdd/.dvc/cache/files/md5' to '/media/joncrall/flash1/smart_data_dvc/.dvc/cache/files/md5' 2023-07-03 11:21:43,990 DEBUG: Preparing to collect status from '/media/joncrall/flash1/smart_data_dvc/.dvc/cache/files/md5' 2023-07-03 11:21:43,990 DEBUG: Collecting status from '/media/joncrall/flash1/smart_data_dvc/.dvc/cache/files/md5' Everything is up to date. 2023-07-03 11:21:44,001 DEBUG: Analytics is disabled.
Nothing here seems to be helpful. So I started manually digging into it. I saw the content of S2.dvc was:
outs: - md5: 82ece028a0e44746fea5aee2e2603f6a.dir size: 3672919 nfiles: 630 hash: md5 path: S2
so I went to check: does that file exist in my cache? So in the SSD I ran:
( test -f $(dvc cache dir)/files/md5/82/ece028a0e44746fea5aee2e2603f6a.dir && echo "exist" ) || echo "missing"
and I got “missing”. I reran the same command in the HDD and I got “exist”. So, the file that S2.dvc
is referencing exists on the remote, and does not exist locally, but
dvc pull is not detecting that?
For good measure I also reran using dvc 2.0 cache paths:
( test -f $(dvc cache dir)/82/ece028a0e44746fea5aee2e2603f6a.dir && echo "exist" ) || echo "missing"
which returned missing on both repos (which is expected).
I also ran
dvc checkout S2.dvc and it didn’t do anything. It just collected stages, said Analytics is disabled, and exited with a return code of 0.
Because I was using a development version of DVC for this, I also ran a final tests where I uninstalled my editable dvc-data, dvc-objects, and dvc repo, then then installed the latest dvc from pypi, which gave me these versions:
Subprojects: dvc_data = 2.3.3 dvc_objects = 0.23.0 dvc_render = 0.3.1 dvc_task = 0.3.0 scmrepo = 1.0.4
dvc checkout did the same thing, and
dvc pull did the same thing. I also tried on dvc 3.2.0 and dvc 3.1.0 with no change.
In case it matters, the only non-remote parts of my SSD dvc config / config.local are:
[cache] type = "symlink,reflink,hardlink,copy" shared = group protected = true [core] analytics = false autostage = true
I’m pretty stumped by this. Is there any reason why pulling a DVC file where the first cache path it references doesn’t exist, would report that everyting is up to date?