Copying a dvc repository

I have realized that a dvc repository cannot be backed up as I would any working directory, since links cannot be backed up (except by archiving).

I am using symlinks with cache.protected true.

Apparently, the way to backup a dvc repository is to set up another dvc repository for this purpose and transfer all changes there.

I have set up a backup dvc repository on a flash drive, as follows:

cd /path/to/mount/point/
mkdir backup
cd backup
git init
dvc init
dvc remote add -d mainremote /path/to/my/dvc/repository
git add -A
git commit
git clone --no-hardlinks /path/to/my/dvc/repository
dvc pull

What happened, was that git clone did not copy any symlinks and dvc pull reported:

WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:

And here it listed all my data files and directories, some of them twice.

After that a few more error messages followed, saying that the cache files could not be downloaded since they do not exist on the flash repository.

Please help.

I have realized that the file system on my flash drive is msdos and it does not support symlinks at all.

Is there any good way of backing up my dvc repository to a removable drive?

I can set up incremental tar backup, although I do not like this option.

If I get a removable drive that supports symlinks, is there any good way to backup my dvc repository to it?

Hi @byoussin !

Indeed, if you try to simply copy your full dvc repo that is set up to use symlinks to FAT partition, you will indeed run into errors, as FAT doesn’t support symlinks. It would work when copying to a fs that supports symlinks though :slight_smile: But I think there is a better way to back it up anyway. To backup your dvc repo, you need to backup two components: git repository(don’t have to worry about it if you have it on github already) and dvc cache(usually in .dvc/cache, unless you’ve changed it to point to another dir). Git repository tracks .dvc files, that have metadata that tells dvc which cache files to link where in your workspace. So something like git clone + cp path/to/repo/.dvc/cache .dvc/cache would be an alternative to copying the whole repo when working with FAT. Also, another way would be to set up your FAT flash as a dvc remote with dvc remote add -d mybackup /path/to/my/fat/flash and run git push && dvc push to backup data for the current workspace(i.e. current commit).

@kupruser Thanks!
I have found another removable that supports symlinks and am backing up my repo by copying by rsync; this is the simplest solution.

As for your other solution,

So something like git clone + cp path/to/repo/.dvc/cache .dvc/cache would be an alternative to copying the whole repo when working with FAT.

I would not know how to restore from such backup. I suggest that for benefit of other users you post the directions somewhere.

As for your third solution,

Also, another way would be to set up your FAT flash as a dvc remote with dvc remote add -d mybackup /path/to/my/fat/flash and run git push && dvc push to backup data for the current workspace(i.e. current commit).

I think I tried to do something very close: instead of dvc push from my home repository I tried dvc pull from the backup repositiory (see my script above), and it did not work.

I have found another removable that supports symlinks and am backing up my repo by copying by rsync; this is the simplest solution.

Glad to hear it works for you :slight_smile:

I would not know how to restore from such backup. I suggest that for benefit of other users you post the directions somewhere.

Just git clone and then copy .dvc/cache in place and run dvc checkout, as simple as that.

Backup consists of backing up git repo and dvc remote and doesn’t differ that much from the way you would’ve shared your dvc project with other people, which is described in https://dvc.org/doc/use-cases/share-data-and-model-files . Your experience was not very good because you’ve tried to copy repo as is into FAT flash drive, which doesn’t support symlinks. Please feel free to create an issue on https://github.com/iterative/dvc.org/issues and describing your experince and your thoughts about the backup guide, maybe even consider contributing an small article about it :slightly_smiling_face:

I think I tried to do something very close: instead of dvc push from my home repository I tried dvc pull from the backup repositiory (see my script above), and it did not work.

Your script above does some very weird things like creating a new wrapper git/dvc repostiory and then cloning original one into it. The approach that I’ve suggested should work.

Let us know if you have any questions :slight_smile:

1 Like