I have just started playing with dvc to investigate whether it can help me adding version control to large sets of images used for deep learning.
I made my own small toy example to try to understand how dvc works.
I have a data folder with 10 images of approx 10 MB each, in total ~100 MB
When I run “dvc add data” I can see that the dvc cache grows with ~100 MB
I now replace one of the images in my data folder with another one, i.e. remove one 10 MB image and add another 10 MB image. The total size is still 100 MB
When I run “dvc add data” again to track the changes, the dvc cache grows by another 100 MB! I was thinking that dvc would automatically add the difference between the data sets, but that does not seems to be the case? Does it copy all the content for each “dvc add” command?
If yes, is there any practically convenient way to make dvc detect and store the difference between versions of data sets?
That is odd, how do you check the size of the cache dir? Dvc should already store all old files plus new file, so it should be 110MB total in the cache, that is a core feature for us.
Having tested the commands supplied by jorgeorpinel, it see that “du -sh” does not report the same folder size as right clicking on it and selecting “properties” in Windows explorer.
If I take “du -sh” as the truth, everything works as expected. I guess that Windows does not does not compute the disk size on disk correctly?
We do indeed trust du -sh a lot Not sure how that is implemented in Windows explorer. Our team member also noted that it might be caused by it not quite understanding the hardlinks we use for some optimisation and counting them twice. Sounds reasonable to me as well