@mroutis helped me on the DVC Discord chat with this fantastic answer:
I’ll create a temporary directory to simulate your Google Cloud Storage bucket, and then set up two directories to simulate your old and new repositories:
bucket=$(mktemp -d)
old_repo=$(mktemp -d)
new_repo=$(mktemp -d)
This would be your old repository:
cd ${old_repo}
git init
dvc init
dvc remote add --default bucket ${bucket}
echo "foo" > foo
dvc add foo
dvc push
git add -A
git commit -m "foo is in the house"
At this point, you’ll have an entry like d3/b07384d113edec49eaa6238ad5ff00
on your ${old_repo}/.dvc/cache
and on your ${bucket}
Now, let’s work on the new repository and import foo
from the old one:
cd ${new_repo}
dvc init --no-scm
dvc import ${old_repo} foo
Your files should look like this:
${bucket}
└── d3
└── b07384d113edec49eaa6238ad5ff00
${old_repo}
├── .dvc
│ └── cache
│ └── d3
│ └── b07384d113edec49eaa6238ad5ff00
├── foo
└── foo.dvc
${new_repo}
├── .dvc
│ └── cache
│ └── d3
│ └── b07384d113edec49eaa6238ad5ff00
├── foo
└── foo.dvc
Notice how the cache is distributed accross the repositories and the bucket . Now, let’s remove the ${old repo}
, the cache of our new repository and the data:
rm -rf ${old_repo} .dvc/cache foo
At this point, you are on the same boat as you were before, accidentally removed the previous repository that was used as a “pointer” and has no data on your workspace, just the foo.dvc
file. If you try to dvc pull
you’ll see the following error message:
ERROR: failed to fetch data for 'foo' - Failed to clone repo '${old_repo}' to '/tmp/tmpo1_4i3kudvc-erepo': Cmd('git') failed due to: exit
code(128)
cmdline: git clone --no-single-branch -v ${old_repo} /tmp/tmpo1
_4i3kudvc-erepo
stderr: 'fatal: repository '${old_repo}' does not exist
'
My suggestion was to remove the dependency on ${old_repo}
and stay with just the output, so the foo.dvc
file will look like this:
outs:
- md5: d3b07384d113edec49eaa6238ad5ff00
path: foo
cache: true
metric: false
persist: false
Now, you only need to set up the same remote as the old one:
dvc remote add --default bucket ${bucket}
dvc pull
should work correctly and you should be able to retrieve your files from the google cloud storage