Retrieve data after using dvc import, but deleting original git repo

We ran into a problem, where we had original data inside of a particular original git repo. The repo was connected to a bucket on google cloud storage. We then imported the data into a new repo. We just realized that we accidentally deleted the original git repo at some point. The data is still floating around in our bucket, but we no longer have the original git repo’s dvc files to point to it. We do have the new git repo’s files, but doing a dvc pull now gives this error:

ERROR: failed to fetch data for 'split' - Failed to clone repo '/mnt/myfolder' to '/tmp/tmp_m8bnoq_dvc-repo': Cmd('git') failed due to: exit code(128)
  cmdline: git clone --no-single-branch -v /mnt/myfolder /tmp/tmp_m8bnoq_dvc-repo
  stderr: 'fatal: repository '/mnt/myfolder' does not exist
'

Can we do anything now to retrieve the data?

1 Like

@mroutis helped me on the DVC Discord chat with this fantastic answer:

I’ll create a temporary directory to simulate your Google Cloud Storage bucket, and then set up two directories to simulate your old and new repositories:

bucket=$(mktemp -d)
old_repo=$(mktemp -d)
new_repo=$(mktemp -d)

This would be your old repository:

cd ${old_repo}

git init
dvc init

dvc remote add --default bucket ${bucket}

echo "foo" > foo
dvc add foo
dvc push

git add -A
git commit -m "foo is in the house"

At this point, you’ll have an entry like d3/b07384d113edec49eaa6238ad5ff00 on your ${old_repo}/.dvc/cache and on your ${bucket}

Now, let’s work on the new repository and import foo from the old one:

cd ${new_repo}

dvc init --no-scm
dvc import ${old_repo} foo

Your files should look like this:

${bucket}
└── d3
   └── b07384d113edec49eaa6238ad5ff00

${old_repo}
├── .dvc
│  └── cache
│     └── d3
│        └── b07384d113edec49eaa6238ad5ff00
├── foo
└── foo.dvc

${new_repo}
├── .dvc
│  └── cache
│     └── d3
│        └── b07384d113edec49eaa6238ad5ff00
├── foo
└── foo.dvc

Notice how the cache is distributed accross the repositories and the bucket . Now, let’s remove the ${old repo} , the cache of our new repository and the data:

rm -rf ${old_repo} .dvc/cache foo

At this point, you are on the same boat as you were before, accidentally removed the previous repository that was used as a “pointer” and has no data on your workspace, just the foo.dvc file. If you try to dvc pull you’ll see the following error message:

ERROR: failed to fetch data for 'foo' - Failed to clone repo '${old_repo}' to '/tmp/tmpo1_4i3kudvc-erepo': Cmd('git') failed due to: exit
code(128)
  cmdline: git clone --no-single-branch -v ${old_repo} /tmp/tmpo1
_4i3kudvc-erepo
  stderr: 'fatal: repository '${old_repo}' does not exist
'

My suggestion was to remove the dependency on ${old_repo} and stay with just the output, so the foo.dvc file will look like this:

outs:
- md5: d3b07384d113edec49eaa6238ad5ff00
  path: foo
  cache: true
  metric: false
  persist: false

Now, you only need to set up the same remote as the old one:

dvc remote add --default bucket ${bucket}

dvc pull should work correctly and you should be able to retrieve your files from the google cloud storage

2 Likes

Thanks @kaleidoescape for sharing this :pray: :slight_smile: