Setup is dvc 0.75.0 with the .deb package under Ubuntu 18.04.
Also tested with windows 10 with .exe package.
I tried with different configuration for the cache type (default, copy and symlink) and get a similar behavior.
I’m testing DVC and trying to understand how it manages the datasets in its cache.
I initialize an empty repository with dvc init
and add data with dvc add data
in directory that contains to datasets:
-
data/data.json
of size 240M -
data/data_1.json
of size 65M
I then run a script that produces an output dataset: dvc run -f prepare_data.dvc -d src/prepare_data.py data
. It creates: output/prepared_data.npy
of size 360M.
The .dvc/cache
directory now contains the following files (names are simplified):
-
06/file1
of size 240M at inode 60294316 -
14/file2
of size 360M at inode 60949288 -
20/file3.dir
at inode 60294287 -
59/file4
of size 65M at inode 60294327 -
dd/file5.dir
at inode 60949289
I commit to git and tag it as expe1. I now modify my script and run dvc repro prepare_data.dvc
. It produces a new file: output/prepared_data.npy
of size 500M. The .dvc/cache
directory now contains the following files:
-
06/file1
of size 240M at inode 60294316 -
14/file2
of size 360M at inode 60949288 -
19/file6
of size 500M at inode 60949290 -
20/file3.dir
at inode 60294287 -
20/file3.dir.unpacked/data.json
of size 240M at inode 60294316 -
20/file3.dir.unpacked/data_1.json
of size 65M at inode 60294327 -
22/file7.dir
at inode 60294291 -
59/file4
of size 65M at inode 60294327 -
dd/file5.dir
at inode 60949289 -
dd/file5.dir.unpacked/prepared_data.npy
of size 360M at inode 60949288
I commit and tag it as expe2. I now want to clean the cache to remove previous outputs from expe1 and run dvc gc
. The .dvc/cache
directory now contains the following files:
-
06/file1
of size 240M at inode 60294316 - empty directory
14
-
19/file6
of size 500M at inode 60949290 -
20/file3.dir
at inode 60294287 -
20/file3.dir.unpacked/data.json
of size 240M at inode 60294316 -
20/file3.dir.unpacked/data_1.json
of size 65M at inode 60294327 -
22/file7.dir
at inode 60294291 -
59/file4
of size 65M at inode 60294327 -
dd/file5.dir
at inode 60949289 -
dd/file5.dir.unpacked/prepared_data.npy
of size 360M at inode 60949288
Therefore contrary to what I expect, the previous version of the output file is still present in the cache at dd/file5.dir.unpacked/prepared_data.npy
. I this the expected behavior ? How can I properly clean the cache ?
It seems that this file is useless as if try to checkout expe1, it raises an error, and I have to reproduce the experiment anyway.