`dvc gc` does not remove files under dir.unpacked

Setup is dvc 0.75.0 with the .deb package under Ubuntu 18.04.
Also tested with windows 10 with .exe package.
I tried with different configuration for the cache type (default, copy and symlink) and get a similar behavior.

I’m testing DVC and trying to understand how it manages the datasets in its cache.
I initialize an empty repository with dvc init and add data with dvc add data in directory that contains to datasets:

  • data/data.json of size 240M
  • data/data_1.json of size 65M

I then run a script that produces an output dataset: dvc run -f prepare_data.dvc -d src/prepare_data.py data. It creates: output/prepared_data.npy of size 360M.
The .dvc/cache directory now contains the following files (names are simplified):

  • 06/file1 of size 240M at inode 60294316
  • 14/file2 of size 360M at inode 60949288
  • 20/file3.dir at inode 60294287
  • 59/file4 of size 65M at inode 60294327
  • dd/file5.dir at inode 60949289

I commit to git and tag it as expe1. I now modify my script and run dvc repro prepare_data.dvc. It produces a new file: output/prepared_data.npy of size 500M. The .dvc/cache directory now contains the following files:

  • 06/file1 of size 240M at inode 60294316
  • 14/file2 of size 360M at inode 60949288
  • 19/file6 of size 500M at inode 60949290
  • 20/file3.dir at inode 60294287
  • 20/file3.dir.unpacked/data.json of size 240M at inode 60294316
  • 20/file3.dir.unpacked/data_1.json of size 65M at inode 60294327
  • 22/file7.dir at inode 60294291
  • 59/file4 of size 65M at inode 60294327
  • dd/file5.dir at inode 60949289
  • dd/file5.dir.unpacked/prepared_data.npy of size 360M at inode 60949288

I commit and tag it as expe2. I now want to clean the cache to remove previous outputs from expe1 and run dvc gc. The .dvc/cache directory now contains the following files:

  • 06/file1 of size 240M at inode 60294316
  • empty directory 14
  • 19/file6 of size 500M at inode 60949290
  • 20/file3.dir at inode 60294287
  • 20/file3.dir.unpacked/data.json of size 240M at inode 60294316
  • 20/file3.dir.unpacked/data_1.json of size 65M at inode 60294327
  • 22/file7.dir at inode 60294291
  • 59/file4 of size 65M at inode 60294327
  • dd/file5.dir at inode 60949289
  • dd/file5.dir.unpacked/prepared_data.npy of size 360M at inode 60949288

Therefore contrary to what I expect, the previous version of the output file is still present in the cache at dd/file5.dir.unpacked/prepared_data.npy . I this the expected behavior ? How can I properly clean the cache ?
It seems that this file is useless as if try to checkout expe1, it raises an error, and I have to reproduce the experiment anyway.

2 Likes

Hello! Let me try to reproduce this with some dummy files and scripts in a console, as shown below.

Notice your use of dvc run is not complete, as it’s missing the output expected, as well as the actual command to run.

$ dvc version
DVC version: 0.75.0
Python version: 3.7.5
Platform: Darwin-19.0.0-x86_64-i386-64bit
Binary: False
Package: brew

$ mkdir test && cd test
$ git init && dvc init
$ git commit -m "dvc init"

$ mkdir data
$ head -c 25165820 /dev/urandom > data/data  # 24 Mb file
$ head -c 6815744 /dev/urandom > data/data_1 # 6.5 MB file
$ dvc add data
100% Add
...
$ git add .gitignore data.dvc
$ git commit -m "dvc add data"

Let’s do the first exp:

$ mkdir src
$ nano src/prepare_data.sh # edit script to have:
#!/bin/sh
mkdir output
head -c 37748740 /dev/urandom > output/prepared_data # 36 Mb file
$ chmod +x src/prepare_data.sh
$ dvc run -f prepare_data.dvc -d src/prepare_data.sh src/prepare_data.sh
Running command:
        src/prepare_data.sh
...
$ ls -l -R .dvc/cache
...
.dvc/cache/0b:
total 73736
-rw-r--r--  1 usr  staff  36M Dec 12 11:37 458e7ecd66d81d151680e6285e6d98
.dvc/cache/31:
total 8
-rw-r--r--  1 usr  staff  130B Dec 12 11:26 f2a65f9b5d39febf311e190ebf604d.dir
.dvc/cache/55:
total 13312
-rw-r--r--  1 usr  staff  6.5M Dec 12 11:20 a21b3070164f77f400b9e9614f032a
.dvc/cache/e2:
total 49152
-rw-r--r--  1 usr  staff  24M Dec 12 11:20 3e374c81347968b22f365c465fa11b
$ git add src prepare_data.dvc output/.gitignore
$ git commit -m "dvc run src/prepare_data.sh"
$ git tag -a "expe1" -m "first experiment"

As you can see I only have one .dir in my cache, because I used -o in dvc run which adds a .gitignore file to output/ so I don’t manually add the whole output dir. That’s a difference to keep in mind. Anyway, not too related to your question, so let’s keep going.

$ nano src/prepare_data.sh # edit script to:
#!/bin/sh
mkdir output
head -c 52428800 /dev/urandom > output/prepared_data # 50 Mb file
$ dvc repro prepare_data.dvc
WARNING: Dependency 'src/prepare_data.sh' of 'prepare_data.dvc' changed because it is 'modified'.
WARNING: Stage 'prepare_data.dvc' changed.
Running command:
        src/prepare_data.sh
# This creates 50M file .dvc/cache/83/c523fcd...
$ git add src prepare_data.dvc
$ git commit -m "dvc repro prepare_data.dvc"
$ git tag -a "expe2" -m "second experiment"

I now want to clean the cache to remove previous outputs from expe1 and run dvc gc

That’s the correct approach, as described in https://dvc.org/doc/command-reference/gc :

This command deletes (garbage collects) data files or directories that may exist in the cache … but no longer referenced in DVC-files currently in the workspace.

$ dvc gc
WARNING: This will remove all cache except items used in the working tree of the current repo.
Are you sure you want to proceed? [y/n] y
$ ls -lh -R .dvc/cache
...
.dvc/cache/0b:
.dvc/cache/31:
total 8
-rw-r--r--  1 jop  staff   130B Dec 12 11:26 f2a65f9b5d39febf311e190ebf604d.dir
.dvc/cache/55:
total 13312
-rw-r--r--  1 jop  staff   6.5M Dec 12 11:20 a21b3070164f77f400b9e9614f032a
.dvc/cache/83:
total 102400
-rw-r--r--  1 jop  staff    50M Dec 12 11:58 c523fcdbcba3c3cb60dec7b4ea4b31
.dvc/cache/e2:
total 49152
-rw-r--r--  1 jop  staff    24M Dec 12 11:20 3e374c81347968b22f365c465fa11b 

As you can see .dvc/cache/0b which used to have the 36 Mb file is now empty. Also, I don’t get any of the .dir.unpacked/ dirs. So I can’t reproduce the problem you are reporting. If you could send exact commands in this same fashion, maybe we can find the source of the issue. I have a feeling it has to do with your use of dvc run (not specifying the output file with -o).

It seems that this file is useless as if try to checkout expe1, it raises an error, and I have to reproduce the experiment anyway.

What error do you get? From Git or from DVC? Thanks

1 Like

P.s. thinking about this I see maybe I’ve missed the main point which is specifically about the “unpacked” directories. You may have indeed found a bug with dvc gc in fact. Mind opening an issue in our GitHub repo? This way the view engineering team will look into it as well. https://github.com/iterative/dvc/issues

1 Like

Thanks for the answer. I correct myself: the command I used to run dvc is indeed using -o:
dvc run -f prepare_data.dvc -d src/prepare_data.py -d data -o output python src prepare_data.py data
As you mentioned it has to do with the unpacked directories created. They use hardlinks to files already present in the cache (even if the cache type is copy or symlink). dvc gc remove one pointer to the file, but not the other under the unpacked directory.

2 Likes

Also if after dvc gc I run ‘git checkout expe1’ and dvc checkout the error raised is:
ERROR: unexpected error - Checkout failed for the following target: output Did you forget to fetch ?
Fetching does not help as it has no remote storage.

I think this is because dvc gc does delete the main file from cache and assumes the one in unpacked dir isn’t available either.

Anyway, thanks for opening https://github.com/iterative/dvc/issues/2946 to address all this!

2 Likes