Dvc get error: Unable to find DVC file

In a cloned local git repo, there is already some dvc files and I want to add new data.
In the config file under .dvc folder, i have

[core]
    remote = my-remote
[cache]
    type = "reflink,hardlink"
['remote "my-remote"']
    url = s3://...

And I did

dvc add --external /path/to/mydata/
git add mydata.dvc
git commit -m "added my data"
dvc remote default my-remote
dvc push

Everything is good so far since it shows 10000 files pushed

But when I tried to download the data by dvc get <git url> <folder_name/dvc_filename>
It returned

unexpected error - : Unable to find DVC file with output 
'../../../../../private/var/folders/5v/6xws5skx46z5rg2y33_nwd1mqcvql4/T/tmp26wt3huydvc-clone/folder_name/dvc_filename'

Can anyone helped with this?

Thank you

DEBUG: Version info for developers:
DVC version: 2.3.0 (pip)

Platform: Python 3.8.5 on macOS-10.16-x86_64-i386-64bit
Supports: http, https, s3, ssh
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5s1
Caches: local
Remotes: s3
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc, git

Hi @rancat,

dvc add --external produces what’s known as an “external output”. These are not pushed nor pulled from remote storage. In general we don’t often recommend add --external as it’s meant for a very specific use case (when absolutely no other options are available). More details in https://dvc.org/doc/user-guide/managing-external-data

If you can explain your setup and what you want to achieve we can recommend another way. Perhaps via dvc add --o or --to-remote instead. See https://dvc.org/doc/command-reference/add#example-transfer-to-the-cache and the next example. Thanks

Hi @jorgeorpinel

Thanks for your reply.
Let me explain my case with more details.
I have a model repo which has data files and training scripts. And i want to “save” my data in a data repo by dvc. The data repo is shared with the whole team so that every team member can easily get the data with dvc get.
For this reason, I did dvc add --external <my model repo data> in the data repo.

Hope this makes more sense.

OK. TBH I’m a bit confused by the use of term “repo” above.

So this one is not a DVC repo? Is it a plain Git repo that tracks/versions large files?

For now you you may want to look at the data registry pattern. You’d add the data there first, and then reuse it in other places whether in a DVC repo or not (e.g. with dvc import or dvc get, respectively).

Here all ‘repo’ are git repos.
The data repo is a git repo containing only data (e.g., dvc files). i think this is a dvc repo?
The model repo is a git repo containing model (including training scripts, etc.) with data. But the data will not be tracked in this model repo because I add them to .gitignore. instead, i want the dvc repo to track it.
So in the dvc repo i run dvc add --external <my model repo data>. i don’t know why the pushed data is not reusable…

Yes, that’s what we mean by DVC repo :slightly_smiling_face:

Right. So why not regular dvc add it in the DVC repo first, and then dvc get it into the model repo when needed?

add --external data is not pushed to remote storage. The external cache you setup in order to be able to add --external is already in a way “remote” (external) storage. See DVC fails to push data from external cache to default remote · Issue #4686 · iterative/dvc · GitHub

BTW this is noted in External Data.

Thank you.
I took your suggestion and dvc push

It showed Everything is up to date
I can successfully download the data on my local mac. But when i tried dvc get on a remote machine, i got
Some of the cache files do not exist neither locally nor on remote.

Attach the debug if helps.

Traceback (most recent call last):
File “/home/ubuntu/anaconda3/lib/python3.8/site-packages/dvc/main.py”, line 55, in main
ret = cmd.do_run()
File “/home/ubuntu/anaconda3/lib/python3.8/site-packages/dvc/command/base.py”, line 64, in do_run
return self.run()
File “/home/ubuntu/anaconda3/lib/python3.8/site-packages/dvc/command/get.py”, line 31, in run
return self._get_file_from_repo()
File “/home/ubuntu/anaconda3/lib/python3.8/site-packages/dvc/command/get.py”, line 37, in _get_file_from_repo
Repo.get(
File “/home/ubuntu/anaconda3/lib/python3.8/site-packages/dvc/repo/get.py”, line 55, in get
repo.repo_fs.download(from_info, to_info, jobs=jobs)
File “/home/ubuntu/anaconda3/lib/python3.8/site-packages/dvc/fs/base.py”, line 265, in download
return self._download_dir(
File “/home/ubuntu/anaconda3/lib/python3.8/site-packages/dvc/fs/base.py”, line 275, in _download_dir
from_infos = list(self.walk_files(from_info, **kwargs))
File “/home/ubuntu/anaconda3/lib/python3.8/site-packages/dvc/fs/repo.py”, line 368, in walk_files
for root, _, files in self.walk(path_info, **kwargs):
File “/home/ubuntu/anaconda3/lib/python3.8/site-packages/dvc/fs/repo.py”, line 355, in walk
yield from dvc_fs.walk(
File “/home/ubuntu/anaconda3/lib/python3.8/site-packages/dvc/fs/dvc.py”, line 198, in walk
self._add_dir(trie, out, **kwargs)
File “/home/ubuntu/anaconda3/lib/python3.8/site-packages/dvc/fs/dvc.py”, line 143, in _add_dir
self._fetch_dir(out, **kwargs)
File “/home/ubuntu/anaconda3/lib/python3.8/site-packages/dvc/fs/dvc.py”, line 140, in _fetch_dir
raise FileNotFoundError
FileNotFoundError

Please share the full steps with exact commands so we can try to determine what went wrong. As it is now I need to assume many of the steps and for me it works locally. I don’t even need to dvc push to remote storage as dvc get copies directly from the cache for repos in the same file system.

Also, keep in mind the data registry should have the corresponding .dvc files committed to Git for dvc get to be able to access them from some other location. Thanks

Before any steps, my dvc repo has some old data dvc files and a .dvc folder which has cache, plots, temp, .gitignore, config.
In the config, there is

[core]
remote = myremote
[cache]
type = “reflink,hardlink”
[‘remote “myremote”’]
url = s3://…/

On my local mac,

  1. I added new data to the dvc repo.
  2. ran the commands
dvc config cache.type reflink,hardlink
dvc add ./path/to/new_data/
git add new_data.dvc
git commit -m "added new data"
dvc push
git push

On my remote machine

dvc get <git repo url> folder_A/folder_B/new_data

then got the error Some of the cache files do not exist neither locally nor on remote.

1 Like

Thanks for the details.

I guess you mv the new_data from the external location into the repo first, right? (previously /path/to/mydata, now path/to/new_data inside the DVC repo).

This would be path/to/new_data.dvc right? BTW please also git add the corresponding .gitignore file.

Again, this would be path/to/new_data, right? (2nd argument)

I can see that this error message is not very informative :slightly_frowning_face:. Can you please share the full debug output? Not just the error message or the Python trace, but the full output of that get commend, adding flag -v.

One possibility is that the remote machine can’t connect to the S3 bucket. Does it have AWS CLI configured with the appropriate credentials? ~~For other ways to config S3 auth please see remote modify

UPDATE: Actually I guess from the remote machine you would never run remote modify so probably only the default AWS config is available in that case… I’ll double check with the team :hourglass:

Thanks @jorgeorpinel

you remind me the AWS credentials. I re-login to the AWS now everything is ok.
It seems like the aws login was expired.
The reason why I wasn’t aware of it may be the error message. I remembered when I ran dvc push, it once guided me to aws sso login.

Again, thanks for you time and patience. :wink: you really saved me

1 Like

Glad we figured it out! That error message may not be too helpful indeed… We’ll look into it.

May be related to get: does not ask for password anymore · Issue #5677 · iterative/dvc · GitHub.

@rancat one more Q, to make sure I understand what the problem was:

You mean from the source DVC repo before dvc push? If so, didn’t push give an error that some files couldn’t be pushed to the S3 remote? Bc/ in that case the get error message was correct, and what was misleading was not being informed that not all data had been pushed.

Please lmk. Thanks