dvc add --external produces what’s known as an “external output”. These are not pushed nor pulled from remote storage. In general we don’t often recommend add --external as it’s meant for a very specific use case (when absolutely no other options are available). More details in https://dvc.org/doc/user-guide/managing-external-data
Thanks for your reply.
Let me explain my case with more details.
I have a model repo which has data files and training scripts. And i want to “save” my data in a data repo by dvc. The data repo is shared with the whole team so that every team member can easily get the data with dvc get.
For this reason, I did dvc add --external <my model repo data> in the data repo.
OK. TBH I’m a bit confused by the use of term “repo” above.
So this one is not a DVC repo? Is it a plain Git repo that tracks/versions large files?
For now you you may want to look at the data registry pattern. You’d add the data there first, and then reuse it in other places whether in a DVC repo or not (e.g. with dvc import or dvc get, respectively).
Here all ‘repo’ are git repos.
The data repo is a git repo containing only data (e.g., dvc files). i think this is a dvc repo?
The model repo is a git repo containing model (including training scripts, etc.) with data. But the data will not be tracked in this model repo because I add them to .gitignore. instead, i want the dvc repo to track it.
So in the dvc repo i run dvc add --external <my model repo data>. i don’t know why the pushed data is not reusable…
It showed Everything is up to date
I can successfully download the data on my local mac. But when i tried dvc get on a remote machine, i got Some of the cache files do not exist neither locally nor on remote.
Attach the debug if helps.
Traceback (most recent call last):
File “/home/ubuntu/anaconda3/lib/python3.8/site-packages/dvc/main.py”, line 55, in main
ret = cmd.do_run()
File “/home/ubuntu/anaconda3/lib/python3.8/site-packages/dvc/command/base.py”, line 64, in do_run
return self.run()
File “/home/ubuntu/anaconda3/lib/python3.8/site-packages/dvc/command/get.py”, line 31, in run
return self._get_file_from_repo()
File “/home/ubuntu/anaconda3/lib/python3.8/site-packages/dvc/command/get.py”, line 37, in _get_file_from_repo
Repo.get(
File “/home/ubuntu/anaconda3/lib/python3.8/site-packages/dvc/repo/get.py”, line 55, in get
repo.repo_fs.download(from_info, to_info, jobs=jobs)
File “/home/ubuntu/anaconda3/lib/python3.8/site-packages/dvc/fs/base.py”, line 265, in download
return self._download_dir(
File “/home/ubuntu/anaconda3/lib/python3.8/site-packages/dvc/fs/base.py”, line 275, in _download_dir
from_infos = list(self.walk_files(from_info, **kwargs))
File “/home/ubuntu/anaconda3/lib/python3.8/site-packages/dvc/fs/repo.py”, line 368, in walk_files
for root, _, files in self.walk(path_info, **kwargs):
File “/home/ubuntu/anaconda3/lib/python3.8/site-packages/dvc/fs/repo.py”, line 355, in walk
yield from dvc_fs.walk(
File “/home/ubuntu/anaconda3/lib/python3.8/site-packages/dvc/fs/dvc.py”, line 198, in walk
self._add_dir(trie, out, **kwargs)
File “/home/ubuntu/anaconda3/lib/python3.8/site-packages/dvc/fs/dvc.py”, line 143, in _add_dir
self._fetch_dir(out, **kwargs)
File “/home/ubuntu/anaconda3/lib/python3.8/site-packages/dvc/fs/dvc.py”, line 140, in _fetch_dir
raise FileNotFoundError
FileNotFoundError
Please share the full steps with exact commands so we can try to determine what went wrong. As it is now I need to assume many of the steps and for me it works locally. I don’t even need to dvc push to remote storage as dvc get copies directly from the cache for repos in the same file system.
Also, keep in mind the data registry should have the corresponding .dvc files committed to Git for dvc get to be able to access them from some other location. Thanks
I guess you mv the new_data from the external location into the repo first, right? (previously /path/to/mydata, now path/to/new_data inside the DVC repo).
This would be path/to/new_data.dvc right? BTW please also git add the corresponding .gitignore file.
Again, this would be path/to/new_data, right? (2nd argument)
I can see that this error message is not very informative . Can you please share the full debug output? Not just the error message or the Python trace, but the full output of that get commend, adding flag -v.
One possibility is that the remote machine can’t connect to the S3 bucket. Does it have AWS CLI configured with the appropriate credentials? ~~For other ways to config S3 auth please see remote modify
UPDATE: Actually I guess from the remote machine you would never run remote modify so probably only the default AWS config is available in that case… I’ll double check with the team
you remind me the AWS credentials. I re-login to the AWS now everything is ok.
It seems like the aws login was expired.
The reason why I wasn’t aware of it may be the error message. I remembered when I ran dvc push, it once guided me to aws sso login.
Again, thanks for you time and patience. you really saved me
@rancat one more Q, to make sure I understand what the problem was:
You mean from the source DVC repo before dvc push? If so, didn’t push give an error that some files couldn’t be pushed to the S3 remote? Bc/ in that case the get error message was correct, and what was misleading was not being informed that not all data had been pushed.