Help with upgrading imported (via dvc2.x) dvc data with dvc3.0

Hi,

I have a data repository setup similar to Data Registry
And there are to versions v1 and v2 of the dataset (setup via git tags)
All the data is on the remote and also in a local/shared(!) cache (done via dvc add/push etc. dvc status AND dvc status -c say that all is in sync).

My separate project imported v1 and doing a
dvc checkout
gets the data quickly from a local/shared cache. => All good !!!

NOTE: the data.dvc file points to data stored in the dvc 2.0 cache location ( /xx/ and not /files/md5/xx) and the original dvc import command was done in 2.0

I then update data.dvc to v2
dvc update --rev v2 data.dvc
but the data (although in the cache) seems to get downloaded before a link can be established and I run out of hard disc space.

There should be no need to download anything because all the files are in the local/shared cache so I tried:

dvc update --rev v2 data.dvc --no-download
dvc checkout -v

I am getting errors like
No such file or directory: ‘/dvc-cache/fs/local//someDirectoryName/data.bin’
I can’t determine where the value comes from. Also data.bin is the real file name and not any md5 hash.

I am not sure why dvc searches “fs/local” because all v2 files(added with dvc 3.0) live in /dvc-cache/files/md5/xx/ .

I have tried a dvc pull data.dvc after update and that seems to start to create a fs/local directory but that seems to create now duplicates of files in the cache (so I have files in cache/files/md5 and cache/fs/local//realFileName.bin ) and I run out of space again.

Now I am very lost on how to solve this.

I also noticed that after updating the dvc file the

outs:
-md5 :

section is empty. Before it pointed at a .dir file

Any help would be really appreciated.

I’ve reproduced this (at least partially) with:

dvc import https://github.com/iterative/dataset-registry get-started/data.xml
rm -f data.xml data.xml.dvc
dvc import https://github.com/iterative/dataset-registry get-started/data.xml --no-download
dvc checkout -v

Also getting:

failed to create '/Users/ivan/Projects/test-dvc/data.xml' from '/Users/ivan/Projects/test-dvc/.dvc/cache/files/md5/fs/local/3a7b948ce2a36294f8c50df53f1f7c92' - [Errno 2] No such file or directory

I’m not sure if those are all the bugs in this scenario, but this behavior looks unexpected to me.

Oh cool . Sounds like some part can be repro’ed. Should I file a bug on github regarding those steps ?

Hey @run45 . Thanks for the feedback! I’m taking a look…

That error is expected, because --no-download doesn’t compute md5 for the file (rightfully so) and so it can’t checkout from md5 cache, so it falls back to non-odb cache and sees that it doesn’t have that file as well. Though there is a bug with non-odb cache path, but it is likely not related.

Taking a look at the original issue…

Ok, so this is the main issue. The reason is that we don’t deduplicate files between dvc 2.0 and dvc 3.0 cache. We have discussed before that we might want to do hardlinks or have some kind of hash table to dedup those, but we didn’t have the capacity to implement that. Seems like you are tight on space, so I think the only workaround for you right now is to delete cache for data.dvc v1 and then dvc update, unless you have a need to constantly switch back and forward between those versions. Also, how big is that data? Are you just short on space or is the data truly giant?

Just to make sure I understand the situation correctly. Both dataset registry and the project you are in, use the same cache, right? And dataset registry is using dvc 2.0 and has data v2 in cache already but at dvc 2.0 location, but you are dvc updateing with dvc 3.0. Is that correct?

. Also, how big is that data ? Are you just short on space or is the data truly giant?

about 4TB . I don’t mind a one-off long (1 or 2 day) process if I could end up with a situation where I can easily switch between v1 and v2 . Also the cache is “shared” and on a different drive than the ml project that uses it (and there is even less space).

delete cache for data.dvc v1 and then dvc update ,

so would that remove the dvc 2.0 created cache entries and recreate v1 entries for dvc 3.0
What is the best way to “delete cache for data.dvc v1” ?

Not sure if it is a additional complication but I noticed duplicates for lots of shared files in v1 and v2 on both the remote and the cache (v1 and v2 are not completely different they have shared files and because of dvc2.0/3.0 change I seem to have lots of duplicates with same md5 but in different locations in cache AND remote).

4TB is a collection of (how many approx, btw?) files in data directory, right?

Do you often modify that data or is it rather immutable?

Discregard my suggestion about deleting cache, clearly with 4TB of data and the need to switch between versions that’s a no go.

That duplication is again because we don’t dedup v2 cache and v3 cache files :slightly_frowning_face:

Btw, could you show your dvc doctor, please? Just want to see some additional info.

Do you use a remote, btw? Or only shared cache?

1st : thanx so much with helping me with this.

4TB is a collection of (how many approx, btw?) files in data directory, right?

yes … 30000 files approx. v2 has has only some files different to v1 de-duplicating the shared entries would probably save quite a bit of space.

regarding dvc doctor:

DVC version: 3.10.1 (pip)

Platform: Python 3.10.12 on Linux-5.15.0-82-generic-x86_64-with-glibc2.35
Subprojects:
dvc_data = 2.8.1
dvc_objects = 0.24.1
dvc_render = 0.5.3
dvc_task = 0.3.0
scmrepo = 1.1.0
Supports:
http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
ssh (sshfs = 2023.7.0)
Config:
Global: /home//.config/dvc
System: /etc/xdg/dvc
Cache types: symlink
Cache directory: ext4 on /dev/md0
Caches: local
Remotes: ssh, ssh, ssh
Workspace directory: ext4 on /dev/mapper/ubuntu–vg-ubuntu–lv
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/
=======================

Do you use a remote, btw? Or only shared cache?

yes. I am using a remote which also has all that duplicated dvc2.0/dvc3.0 data on it. So setup is:

  1. slow large hdd that gets backed up
  2. fast smaller shared cache on different (fast) hdd … big enough to hold v1 and v2 though (but not v3 once I create that)

Because the large hdd gets backed up I could role back the remote when only v1 was there , clear cache => migrate somehow to dvc3.0 (on remote) and then add again v2 data to both cache and remote… but that is probably my last resort (also still not sure how I could either migrate the remote to 3.0 OR remove the duplicates).

Someone (maybe it’s you?) created how to remove dvc v2.0/3.0 duplicates in remote (and cache) · Issue #9924 · iterative/dvc · GitHub , so I guess we can move there.

Seems like dedup is the only proper way to go here, so that dvc can link from both v2 and v3.

Yes that was me . I asked the duplicate remote question (not the upgrade problem one) on discord and got sent a link to github to open the issue. But by that time “deduplicating remote” and “not being able to update to v2” seemed to separate issues to me.

This leaves me with the questions

  1. So if I live for the moment with duplicated data in remote & cache: how do I upgrade my data.dvc file
    to v2 . I know the suggestion was

delete cache for data.dvc v1 and then dvc update ,

but am not sure how to do the first step and and what it does : whould I then end up with both v1 and v2 in cache in dvc 3.0 format?

but the data (although in the cache) seems to get downloaded before a link can be established and I run out of hard disc space.

This is the main problem when doing the update atm. All of the files seem to get downloaded at once to the project hdd first (which is small) and then “maybe” moved to the cache and symlinked. I says “maybe” because currently I fail at downloading before the symlink could happen.

A bit desperate on a solution. I don’t mind if any fix would take a long time because of the 4TB if it is a on-off operation.

so to summarize :

v1 and v2 fit in the cache that lives on hddA

for project B living on a small hddB:
checking out v1 for data.dvc is fine → a symlink is just created to the cache
upgrading to v2 for data.dvc fails because data gets downloaded to hddB first.
downloading should not be required because the v2 files live in the cache
already (and that seems to have something to do with the dvc3.0/2.0 cache differences… not sure)

this might be a stupid question but if I just move all the version2.0 files from cache and remote into the /files/md5 directory then should that just all work? . given that it has the same md5 name it should remove duplicates -right ? Would anything that refers to the old data be able to find the files in the new /files/md5/ location ?

That won’t work. The reason for the completely separate storages is that DVC 2.x MD5 is not the same as DVC 3.x MD5 (see Upgrading to DVC 3.0 for details). DVC will only look for old 2.x data in the 2.x cache/remote location (so it will not look for old data in /files/md5).

The reason that dvc cache migrate only supports local cache is that we can create symlinks/hardlinks/reflinks to do the file deduplication between 3.x and 2.x locations on a local filesystem (so we can have the actual file stored in the 3.x location, and a link pointing to the file in the old 2.x location). But unfortunately remote storages generally do not support linking.

thanx for the update. In my case I am having only binary files and was hoping that in that case the md5’s would be the same (assumed that it only affected CR/LF/text issues).

Just to clarify here, DVC already does support 2.x/3.x deduplication for local cache (via dvc cache migrate). Deduplication is only currently unsupported for remotes.

thanx for the clarification. I have not tried the cache migration yet.

The main issue I am still trying to resolve is how to dvc upgrade my data.dvc file in a way to i can quickly switch between v1 and v2 of the dataset. Only the 2nd issue for me is how to remove duplicates on remote.

But from some of the comments above I got the impression that all my problems stem from the fact of duplicates (maybe I misunderstood). So given that all my data are binary and not txt I assumed that in that case I could just move the data. But don’t dare because I am not sure of side-effects that I didn’t consider.

Also given that I could try “dvc cache migrate” : would that solve my actual issue of not being able to upgrade and switch between 2 versions of my dataset (although both version are in the cache) ?

Also given that I could try “dvc cache migrate” : would that solve my actual issue of not being able to upgrade and switch between 2 versions of my dataset (although both version are in the cache) ?

Running dvc cache migrate should make it so that dvc update --rev v2 does not re-download anything that is already in your local cache

thanx for that comment. I will try the migrate. For testing I tried it with --dry -v option. Showed me (correctly!) lots of filed that did not need migration (like the 3.0 files and the files in fs/local (=not sure how those files ended up in my cache) ). Also gave me the number of files that it will migrate just not their names (hope that is ok).

Ok. Finally bit the bullet and re-created the repository from scratch only using dvc 3.x
So remote and cache are in the same state and no old dvc2.x hashes are anywhere.

So when I do a “dvc import” I still get the behaviour that the files get downloaded from the remove although they in the cache.

What I see is that the file gets downloaded from the remote , then the link to the hash established. But the files are already in the cache. So the download is not required. I run out of discspace like this. What am I doing wrong ? (I had this all working with dvc2.x)