I have just started playing with dvc to investigate whether it can help me adding version control to large sets of images used for deep learning.
I made my own small toy example to try to understand how dvc works.
I have a data folder with 10 images of approx 10 MB each, in total ~100 MB
When I run “dvc add data” I can see that the dvc cache grows with ~100 MB
I now replace one of the images in my data folder with another one, i.e. remove one 10 MB image and add another 10 MB image. The total size is still 100 MB
When I run “dvc add data” again to track the changes, the dvc cache grows by another 100 MB! I was thinking that dvc would automatically add the difference between the data sets, but that does not seems to be the case? Does it copy all the content for each “dvc add” command?
If yes, is there any practically convenient way to make dvc detect and store the difference between versions of data sets?