Setup is dvc 0.75.0 with the .deb package under Ubuntu 18.04.
Also tested with windows 10 with .exe package.
I tried with different configuration for the cache type (default, copy and symlink) and get a similar behavior.
I’m testing DVC and trying to understand how it manages the datasets in its cache.
I initialize an empty repository with dvc init and add data with dvc add data in directory that contains to datasets:
-
data/data.jsonof size 240M -
data/data_1.jsonof size 65M
I then run a script that produces an output dataset: dvc run -f prepare_data.dvc -d src/prepare_data.py data. It creates: output/prepared_data.npy of size 360M.
The .dvc/cache directory now contains the following files (names are simplified):
-
06/file1of size 240M at inode 60294316 -
14/file2of size 360M at inode 60949288 -
20/file3.dirat inode 60294287 -
59/file4of size 65M at inode 60294327 -
dd/file5.dirat inode 60949289
I commit to git and tag it as expe1. I now modify my script and run dvc repro prepare_data.dvc. It produces a new file: output/prepared_data.npy of size 500M. The .dvc/cache directory now contains the following files:
-
06/file1of size 240M at inode 60294316 -
14/file2of size 360M at inode 60949288 -
19/file6of size 500M at inode 60949290 -
20/file3.dirat inode 60294287 -
20/file3.dir.unpacked/data.jsonof size 240M at inode 60294316 -
20/file3.dir.unpacked/data_1.jsonof size 65M at inode 60294327 -
22/file7.dirat inode 60294291 -
59/file4of size 65M at inode 60294327 -
dd/file5.dirat inode 60949289 -
dd/file5.dir.unpacked/prepared_data.npyof size 360M at inode 60949288
I commit and tag it as expe2. I now want to clean the cache to remove previous outputs from expe1 and run dvc gc. The .dvc/cache directory now contains the following files:
-
06/file1of size 240M at inode 60294316 - empty directory
14 -
19/file6of size 500M at inode 60949290 -
20/file3.dirat inode 60294287 -
20/file3.dir.unpacked/data.jsonof size 240M at inode 60294316 -
20/file3.dir.unpacked/data_1.jsonof size 65M at inode 60294327 -
22/file7.dirat inode 60294291 -
59/file4of size 65M at inode 60294327 -
dd/file5.dirat inode 60949289 -
dd/file5.dir.unpacked/prepared_data.npyof size 360M at inode 60949288
Therefore contrary to what I expect, the previous version of the output file is still present in the cache at dd/file5.dir.unpacked/prepared_data.npy . I this the expected behavior ? How can I properly clean the cache ?
It seems that this file is useless as if try to checkout expe1, it raises an error, and I have to reproduce the experiment anyway.