Can't go to former version of dataset with `dvc checkout`

Hi,

I am trying out DVC for the first time for my deep learning development pipeline. As mentioned in the Version Data with DVC tutorial, I created a folder named “dataset” in the main directory and then I added some data into it. And then I used dvc add dataset to track the dataset. After that, I included “dataset” folder in the .gitignore to untrack the dataset with git.

Then I setup Azure storage blob for my remote data repo. After adding some more data, I wanted to go back to the original version of the data so I checkout to that git commit and then used dvc checkout command. Then this error comes up.

+--------------------------------------------------+
|                                                  |
|        Update available 1.10.2 -> 1.11.10        |
|     Run `apt-get install --only-upgrade dvc`     |
|                                                  |
+--------------------------------------------------+

WARNING: Cache 'HashInfo(name='md5', value='db4287e726604a63a231cf4462cb27df.dir', dir_info=None, size=497535926, nfiles=8818)' not found. File 'dataset' won't be created.
ERROR: Checkout failed for following targets:
dataset
Is your cache up to date?
<https://error.dvc.org/missing-files>

I tried searching for the solution in the given link but I could find it. Thanks in advance.

@htutlynn
Can you post result of dvc doctor command? Also, dvc checkout -v could be helpful.

I have a question: are you checking out your work on the same machine, or are you working on fresh clone of your repo?

Hi @Paffciu,

Here is the output of dvc doctor command.

Platform: Python 3.7.9 on Linux-5.4.0-58-generic-x86_64-with-debian-buster-sid
Supports: All remotes
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sdb5
Caches: local
Remotes: azure
Workspace directory: ext4 on /dev/sdb5
Repo: dvc, git

Output of dvc checkout -v at initial dataset git commit.

2021-01-12 18:07:00,403 DEBUG: Assuming '/home/htut/Desktop/dvc_test/.dvc/cache/98/dfe29d5df6ee196c81f550443f8801' is unchanged since it is read-only
2021-01-12 18:07:00,403 DEBUG: Assuming '/home/htut/Desktop/dvc_test/.dvc/cache/22/dfe20acfd3c48dd0f4203d9f45bbba' is unchanged since it is read-only
2021-01-12 18:07:00,403 DEBUG: Assuming '/home/htut/Desktop/dvc_test/.dvc/cache/63/914ab4bdeef68cb5dcc2603e149792' is unchanged since it is read-only
2021-01-12 18:07:00,403 DEBUG: Assuming '/home/htut/Desktop/dvc_test/.dvc/cache/1b/45c831a48780b67dfe7d89a5b852d9' is unchanged since it is read-only
2021-01-12 18:07:00,403 DEBUG: Assuming '/home/htut/Desktop/dvc_test/.dvc/cache/02/2633c10858e12b1698397918bea4fb' is unchanged since it is read-only
2021-01-12 18:07:00,403 DEBUG: Assuming '/home/htut/Desktop/dvc_test/.dvc/cache/61/b573551adad906948553ea52acdeff' is unchanged since it is read-only
2021-01-12 18:07:00,403 DEBUG: Assuming '/home/htut/Desktop/dvc_test/.dvc/cache/20/24c6d77f07506dce96ab73fc20c3c4' is unchanged since it is read-only
2021-01-12 18:07:00,404 DEBUG: Assuming '/home/htut/Desktop/dvc_test/.dvc/cache/6d/3a6e50e8ea21a7535be7371ed5139a' is unchanged since it is read-only
2021-01-12 18:07:00,404 DEBUG: Assuming '/home/htut/Desktop/dvc_test/.dvc/cache/17/2cbcfbfa657c5d4ea41fe0f0b064b9' is unchanged since it is read-only
2021-01-12 18:07:00,404 DEBUG: Assuming '/home/htut/Desktop/dvc_test/.dvc/cache/4f/07c0739eafae00f1f34ea0a41b5851' is unchanged since it is read-only
2021-01-12 18:07:00,404 DEBUG: Assuming '/home/htut/Desktop/dvc_test/.dvc/cache/8f/1a65af83f9a8374bd27509a5f6f662' is unchanged since it is read-only
2021-01-12 18:07:00,404 DEBUG: Removing '/home/htut/Desktop/dvc_test/dataset'
2021-01-12 18:07:00,599 DEBUG: fetched: [(37086,)]
2021-01-12 18:07:00,698 ERROR: Checkout failed for following targets:
dataset
Is your cache up to date?
<https://error.dvc.org/missing-files>
------------------------------------------------------------
Traceback (most recent call last):
  File "dvc/main.py", line 90, in main
  File "dvc/command/checkout.py", line 65, in run
  File "dvc/command/checkout.py", line 49, in run
  File "dvc/repo/__init__.py", line 60, in wrapper
  File "dvc/repo/checkout.py", line 108, in checkout
dvc.exceptions.CheckoutError: Checkout failed for following targets:
dataset
Is your cache up to date?
<https://error.dvc.org/missing-files>
------------------------------------------------------------
2021-01-12 18:07:00,699 DEBUG: Analytics is disabled.

@Paffciu , I am checking out my work on the same machine.

@htutlynn
And the project path is the same?
If it is, can you verify that {project_path}/.dvc/cache/db/4287e726604a63a231cf4462cb27df.dir does not exist?

@Paffciu , I tried looking for .dir files but I could not find it.

Btw, after adding dataset dir with dvc add dataset, I manually added dataset dir to the project’s main .gitignore file before git commit instead of the command that’s got prompted after dvc add dataset command. I am not sure but I kinda doubt that it might be the reason.

@htutlyn it looks like the dvc add command has failed.
Would it be possible for you to retry adding the dataset?
Simplest verification that everything is allrigth would be:

  1. dvc add {data}
  2. rm {data}
  3. dvc checkout

Adding {data} to .gitignore manually should no influence the cache.

@Paffciu , I just did the steps all over again. Re-added the dataset by using dvc add dataset and then delete the dataset folder with rm -rf dataset. When I tried to redo dvc checkout on latest commit, ti works as expected but when I go the initial dataset commit, and tries dvc checkout, it doesn’t work.

@htutlynn
I understand, was it done on your original repo - if so that does not change much because initial version of data had some problems adding.

Would it be possible to retake all the steps, like:

  1. creating new repo
  2. adding initial version of data
  3. rm -rf data
  4. dvc checkout
    ?

HI @Paffciu, sorry for the late reply.
Just like the steps that you described, I created everything from scratch and now it works as exactly as expected.
As for the error, I think manually adding the dataset folder into the project’s main .gitignore caused the error. I didn’t include the step this time and now it works as expected without any errors.

Thanks for the help.

@Paffciu , however when I try to do dvc checkout by going back to a former version of the dataset with git within a repo, cloned on a different machine, the error still persists.

@htutlynn did you run dvc fetch on this particular revision? In case of working with fresh repo clone, you need to update local cache by running either dvc pull or dvc fetch.

dvc pull - you can simply think of it as dvc fetch && dvc checkout

@Paffciu , after pulling a fresh repo from github, I tried both approaches dvc fetch and dvc pull. After that, I go back to a former version of dataset and then used ‘dvc checkout’ but the error is still the same.

However, if I go back to a former dataset commit after pulling fresh repo without immediately dvc pulling or dvc fetching, and then do dvc checkout on that commit it works without that cache not found error. I am not sure if it is supposed to work that way or not.

@htutlynn that is because cache is downloaded for the current project version that is in repository. So, in order to download cache for older data version, you need first to git checkout its revision and only after that, use dvc pull.

2 Likes

@Paffciu , Oh, that makes everything clear. Thanks for the help!

1 Like