here is how I do the api call:
dataset_name = "cat"
version = "cat-v3.0.0"
path = f"datasets/{dataset_name}/data"
repo = "git@github.com:<account-name>/dataset-registry-dvc.git"
resource_url = dvc.api.get_url(
path = path,
repo = repo,
rev = version,
)
print(resource_url)
then the resource_url will be
s3://<my-bucket-name>/dataset-registry/cache/06/<the number>
Then I dig into DVC code process:
from dvc.config import NoRemoteError
from dvc_data.index import StorageKeyError
from typing import Any, Dict, Optional
from funcy import reraise
from dvc.repo import Repo
repo_kwargs: Dict[str, Any] = {}
if remote:
repo_kwargs["config"] = {"core": {"remote": remote}}
with Repo.open(
repo, rev=version, subrepos=True, uninitialized=True, **repo_kwargs
) as _repo:
index, entry = _repo.get_data_index_entry(path)
with reraise(
(StorageKeyError, ValueError),
NoRemoteError(f"no remote specified in {_repo}"),
):
remote_fs, remote_path = index.storage_map.get_remote(entry)
print(remote_fs.unstrip_protocol(remote_path))
the entry I have is:
DataIndexEntry(key=('datasets', 'cat', 'data'), meta=Meta(isdir=False, size=519180577, nfiles=None, isexec=False, version_id=None, etag=None, checksum=None, md5=<06the-number>, inode=None, mtime=None, remote=None), hash_info=HashInfo(name='md5-dos2unix', value=<06the-number>, obj_name=None), loaded=None)
Also here is the remote odb:
index.storage_map.get_remote_odb(entry)
HashFileDB(fs=<dvc_s3.S3FileSystem object at <a-number>>, path='<my-bucket-name>/dataset-registry/cache', read_only=False)
And my .dvc file for this is:
outs:
- md5: 06<the-number>
size: 519180577
path: data
Another information is that this dataset/tree/data was initialized with DVC v2, this time I updated the dataset/tree/data with DVC v3.
Thanks!