I am trying out DVC for the first time for my deep learning development pipeline. As mentioned in the Version Data with DVC tutorial, I created a folder named “dataset” in the main directory and then I added some data into it. And then I used dvc add dataset to track the dataset. After that, I included “dataset” folder in the .gitignore to untrack the dataset with git.
Then I setup Azure storage blob for my remote data repo. After adding some more data, I wanted to go back to the original version of the data so I checkout to that git commit and then used dvc checkout command. Then this error comes up.
+--------------------------------------------------+
| |
| Update available 1.10.2 -> 1.11.10 |
| Run `apt-get install --only-upgrade dvc` |
| |
+--------------------------------------------------+
WARNING: Cache 'HashInfo(name='md5', value='db4287e726604a63a231cf4462cb27df.dir', dir_info=None, size=497535926, nfiles=8818)' not found. File 'dataset' won't be created.
ERROR: Checkout failed for following targets:
dataset
Is your cache up to date?
<https://error.dvc.org/missing-files>
I tried searching for the solution in the given link but I could find it. Thanks in advance.
Platform: Python 3.7.9 on Linux-5.4.0-58-generic-x86_64-with-debian-buster-sid
Supports: All remotes
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sdb5
Caches: local
Remotes: azure
Workspace directory: ext4 on /dev/sdb5
Repo: dvc, git
Output of dvc checkout -v at initial dataset git commit.
2021-01-12 18:07:00,403 DEBUG: Assuming '/home/htut/Desktop/dvc_test/.dvc/cache/98/dfe29d5df6ee196c81f550443f8801' is unchanged since it is read-only
2021-01-12 18:07:00,403 DEBUG: Assuming '/home/htut/Desktop/dvc_test/.dvc/cache/22/dfe20acfd3c48dd0f4203d9f45bbba' is unchanged since it is read-only
2021-01-12 18:07:00,403 DEBUG: Assuming '/home/htut/Desktop/dvc_test/.dvc/cache/63/914ab4bdeef68cb5dcc2603e149792' is unchanged since it is read-only
2021-01-12 18:07:00,403 DEBUG: Assuming '/home/htut/Desktop/dvc_test/.dvc/cache/1b/45c831a48780b67dfe7d89a5b852d9' is unchanged since it is read-only
2021-01-12 18:07:00,403 DEBUG: Assuming '/home/htut/Desktop/dvc_test/.dvc/cache/02/2633c10858e12b1698397918bea4fb' is unchanged since it is read-only
2021-01-12 18:07:00,403 DEBUG: Assuming '/home/htut/Desktop/dvc_test/.dvc/cache/61/b573551adad906948553ea52acdeff' is unchanged since it is read-only
2021-01-12 18:07:00,403 DEBUG: Assuming '/home/htut/Desktop/dvc_test/.dvc/cache/20/24c6d77f07506dce96ab73fc20c3c4' is unchanged since it is read-only
2021-01-12 18:07:00,404 DEBUG: Assuming '/home/htut/Desktop/dvc_test/.dvc/cache/6d/3a6e50e8ea21a7535be7371ed5139a' is unchanged since it is read-only
2021-01-12 18:07:00,404 DEBUG: Assuming '/home/htut/Desktop/dvc_test/.dvc/cache/17/2cbcfbfa657c5d4ea41fe0f0b064b9' is unchanged since it is read-only
2021-01-12 18:07:00,404 DEBUG: Assuming '/home/htut/Desktop/dvc_test/.dvc/cache/4f/07c0739eafae00f1f34ea0a41b5851' is unchanged since it is read-only
2021-01-12 18:07:00,404 DEBUG: Assuming '/home/htut/Desktop/dvc_test/.dvc/cache/8f/1a65af83f9a8374bd27509a5f6f662' is unchanged since it is read-only
2021-01-12 18:07:00,404 DEBUG: Removing '/home/htut/Desktop/dvc_test/dataset'
2021-01-12 18:07:00,599 DEBUG: fetched: [(37086,)]
2021-01-12 18:07:00,698 ERROR: Checkout failed for following targets:
dataset
Is your cache up to date?
<https://error.dvc.org/missing-files>
------------------------------------------------------------
Traceback (most recent call last):
File "dvc/main.py", line 90, in main
File "dvc/command/checkout.py", line 65, in run
File "dvc/command/checkout.py", line 49, in run
File "dvc/repo/__init__.py", line 60, in wrapper
File "dvc/repo/checkout.py", line 108, in checkout
dvc.exceptions.CheckoutError: Checkout failed for following targets:
dataset
Is your cache up to date?
<https://error.dvc.org/missing-files>
------------------------------------------------------------
2021-01-12 18:07:00,699 DEBUG: Analytics is disabled.
@htutlynn
And the project path is the same?
If it is, can you verify that {project_path}/.dvc/cache/db/4287e726604a63a231cf4462cb27df.dir does not exist?
Btw, after adding dataset dir with dvc add dataset, I manually added dataset dir to the project’s main .gitignore file before git commit instead of the command that’s got prompted after dvc add dataset command. I am not sure but I kinda doubt that it might be the reason.
@htutlyn it looks like the dvc add command has failed.
Would it be possible for you to retry adding the dataset?
Simplest verification that everything is allrigth would be:
dvc add {data}
rm {data}
dvc checkout
Adding {data} to .gitignore manually should no influence the cache.
@Paffciu , I just did the steps all over again. Re-added the dataset by using dvc add dataset and then delete the dataset folder with rm -rf dataset. When I tried to redo dvc checkout on latest commit, ti works as expected but when I go the initial dataset commit, and tries dvc checkout, it doesn’t work.
HI @Paffciu, sorry for the late reply.
Just like the steps that you described, I created everything from scratch and now it works as exactly as expected.
As for the error, I think manually adding the dataset folder into the project’s main .gitignore caused the error. I didn’t include the step this time and now it works as expected without any errors.
@Paffciu , however when I try to do dvc checkout by going back to a former version of the dataset with git within a repo, cloned on a different machine, the error still persists.
@htutlynn did you run dvc fetch on this particular revision? In case of working with fresh repo clone, you need to update local cache by running either dvc pull or dvc fetch.
dvc pull - you can simply think of it as dvc fetch && dvc checkout
@Paffciu , after pulling a fresh repo from github, I tried both approaches dvc fetch and dvc pull. After that, I go back to a former version of dataset and then used ‘dvc checkout’ but the error is still the same.
However, if I go back to a former dataset commit after pulling fresh repo without immediately dvc pulling or dvc fetching, and then do dvc checkout on that commit it works without that cache not found error. I am not sure if it is supposed to work that way or not.
@htutlynn that is because cache is downloaded for the current project version that is in repository. So, in order to download cache for older data version, you need first to git checkout its revision and only after that, use dvc pull.