Important: ERROR: failed to pull data from the cloud

I have my DVC repo and I update the data to remote S3. Now I encountered an error. So, the data has been updated to:
s3://<my-bucket-name>/dataset-registry/cache/files/md5/81/<the-number>,
but when I used both dvc pull <data> and dvc.api.get_url(...), it will be failed to pull the data from the cloud. The I found when I use dvc.api.get_url, the url is:
s3://<my-bucket-name>/dataset-registry/cache/81/<the-number>
I don’t know how this happened, the following code is the process I update the file:

pipenv run dvc add <location of your dataset> --out datasets/<dataset>/data --to-remote
git add .
git commit -m "dvc update <dataset>"
git push

Any suggestions will be super helpful!

It looks like there has been some mix-up with different version of dvc you are using. Either you may have changed versions at some point, or you have different environments that are using different versions.

s3://<my-bucket-name>/dataset-registry/cache/files/md5/81/<the-number> is the path expected in dvc>=3.0.

s3://<my-bucket-name>/dataset-registry/cache/81/<the-number> is the path expected in older dvc versions.

See Upgrading to DVC 3.0 for more details.

1 Like

Thanks! I will check my DVC version to see if that causes the issue.

Thank you for the suggestion. I just double check my DVC version with dvc doctor:

DVC version: 3.42.0 (pip)
-------------------------
Platform: Python 3.8.18 on Linux-4.14.330-250.540.amzn2.x86_64-x86_64-with-glibc2.10
Subprojects:
        dvc_data = 3.8.0
        dvc_objects = 3.0.6
        dvc_render = 1.0.1
        dvc_task = 0.3.0
        scmrepo = 2.0.4
Supports:
        http (aiohttp = 3.9.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.3, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2024.2.0, boto3 = 1.34.34)

And when I dig into the detailed code of get_url, I found here is hash_info in my entry:

hash_info=HashInfo(name='md5-dos2unix', value=<my-md5-number>, obj_name=None)

So looks like I already upgraded the dvc to 3. I am still not sure why this still happened.

Can you show the full API call that you run? Also, can you show the corresponding .dvc file or dvc.lock file for that data?

here is how I do the api call:

dataset_name = "cat"
version = "cat-v3.0.0"
path = f"datasets/{dataset_name}/data"
repo = "git@github.com:<account-name>/dataset-registry-dvc.git"
resource_url = dvc.api.get_url(
    path = path,
    repo = repo,
    rev = version,
)
print(resource_url)

then the resource_url will be

s3://<my-bucket-name>/dataset-registry/cache/06/<the number>

Then I dig into DVC code process:

from dvc.config import NoRemoteError
from dvc_data.index import StorageKeyError
from typing import Any, Dict, Optional
from funcy import reraise
from dvc.repo import Repo

repo_kwargs: Dict[str, Any] = {}
if remote:
    repo_kwargs["config"] = {"core": {"remote": remote}}
with Repo.open(
    repo, rev=version, subrepos=True, uninitialized=True, **repo_kwargs
) as _repo:
    index, entry = _repo.get_data_index_entry(path)
    with reraise(
        (StorageKeyError, ValueError),
        NoRemoteError(f"no remote specified in {_repo}"),
    ):
        remote_fs, remote_path = index.storage_map.get_remote(entry)
        print(remote_fs.unstrip_protocol(remote_path))

the entry I have is:

DataIndexEntry(key=('datasets', 'cat', 'data'), meta=Meta(isdir=False, size=519180577, nfiles=None, isexec=False, version_id=None, etag=None, checksum=None, md5=<06the-number>, inode=None, mtime=None, remote=None), hash_info=HashInfo(name='md5-dos2unix', value=<06the-number>, obj_name=None), loaded=None)

Also here is the remote odb:

index.storage_map.get_remote_odb(entry)
HashFileDB(fs=<dvc_s3.S3FileSystem object at <a-number>>, path='<my-bucket-name>/dataset-registry/cache', read_only=False)

And my .dvc file for this is:

outs:
- md5: 06<the-number>
  size: 519180577
  path: data

Another information is that this dataset/tree/data was initialized with DVC v2, this time I updated the dataset/tree/data with DVC v3.
Thanks!

The .dvc file you showed still indicates it’s not been updated to dvc3 syntax. It should include a field like hash: md5 if it is using dvc3 syntax. It’s possible that the change didn’t get committed to git when you pushed the updated file. You can run dvc cache migrate --dvc-files. You should see that it adds this field to your .dvc file, and you can then commit the changes to git.

I tried to use dvc cache migrate --dry, but it will show me this:
image
I am not on the main branch, but I did the git add ., git commit -m and git push

Here is my set in .dvc/config:

[core]
    remote = s3cache
['remote "s3cache"']
    url = s3://<my-bucket-name>/dataset-registry/cache

Please also include --dvc-files when you run dvc cache migrate.

Just did it:

dvc cache migrate --dvc-files

It updated my workspace’s datasets/cat/data.dvc to:

outs:
- md5: <the-number>
  size: 519180577
  path: data
  md5-dos2unix: <the-number>
  hash: md5

Then I push the changes from local to remote by:

git add datasets/cat/data.dvc
git commit -m "migrate"
git push origin <my-branch>

but it still doesn’t work. dvc.api.get_url() still give me the url (cache/<the-number>) not (`cache/files/md5/)