Can the command `dvc add` just add new and changed files?

I have just started playing with dvc to investigate whether it can help me adding version control to large sets of images used for deep learning.

I made my own small toy example to try to understand how dvc works.

I have a data folder with 10 images of approx 10 MB each, in total ~100 MB

When I run “dvc add data” I can see that the dvc cache grows with ~100 MB

I now replace one of the images in my data folder with another one, i.e. remove one 10 MB image and add another 10 MB image. The total size is still 100 MB

When I run “dvc add data” again to track the changes, the dvc cache grows by another 100 MB! I was thinking that dvc would automatically add the difference between the data sets, but that does not seems to be the case? Does it copy all the content for each “dvc add” command?

If yes, is there any practically convenient way to make dvc detect and store the difference between versions of data sets?

Thanks,
Sven

Hi @Sven !

That is odd, how do you check the size of the cache dir? Dvc should already store all old files plus new file, so it should be 110MB total in the cache, that is a core feature for us.

Hi. Here’s a test I ran with 3 5mb files, which seems correct (no file duplication):

$ dvc version
DVC version: 0.71.0
Python version: 3.7.5
Platform: Darwin-19.0.0-x86_64-i386-64bit
Binary: False
Package: brew
Cache: reflink - True, hardlink - True, symlink - True

$ mkdir test && cd test
$ dvc init --no-scm
...
$ mkdir data
$ head -c 5242880 /dev/urandom > data/5mbfile1
$ head -c 5242880 /dev/urandom > data/5mbfile2
$ ls -lAFh
drwxr-xr-x   5 usr  staff   160B Dec  1 12:07 .dvc/
drwxr-xr-x   4 usr  staff   128B Dec  1 12:11 data/
$ ls -lAFh data
-rw-r--r--  1 usr  staff   5.0M Dec  1 12:11 5mbfile1
-rw-r--r--  1 usr  staff   5.0M Dec  1 12:11 5mbfile2
$ dvc add data
100% Add ...
$ du -sh .dvc/cache
 10M	.dvc/cache
$ rm -f data/5mbfile2
$ head -c 5242880 /dev/urandom > data/5mbfile3
$ ls -lAFh data
-rw-r--r--  1 usr  staff   5.0M Dec  1 12:11 5mbfile1
-rw-r--r--  1 usr  staff   5.0M Dec  1 12:11 5mbfile3
$ dvc status
data.dvc:                                                                                                                                    
	changed outs:
		modified:           data
$ dvc add data
WARNING: Output 'data' of 'data.dvc' changed because it is 'modified'                                                                        
100% Add ...
$ du -sh .dvc/cache
 15M	.dvc/cache

Maybe you can help us reproduce the problem with instructions such as these? Thanks

3 Likes

Thanks for your replies and help!

Having tested the commands supplied by jorgeorpinel, it see that “du -sh” does not report the same folder size as right clicking on it and selecting “properties” in Windows explorer.

If I take “du -sh” as the truth, everything works as expected. I guess that Windows does not does not compute the disk size on disk correctly?

1 Like

We do indeed trust du -sh a lot :slightly_smiling_face: Not sure how that is implemented in Windows explorer. Our team member also noted that it might be caused by it not quite understanding the hardlinks we use for some optimisation and counting them twice. Sounds reasonable to me as well :slightly_smiling_face:

1 Like