Cache duplication for external dataset

Hello, I have questions about the DVC cache duplication when I have an external dataset.

There are two cases I’ve tried to figure out according to the location of cache, dataset:

(Assume there are two datasets to version control, i.e., dataset 1 & dataset 2. The size of dataset 1 = 10M, the size of dataset 2 = 10M, and there is 5M duplication between dataset 1 & dataset 2.)

[Case 1] Cache and both datasets are in the same storage as the DVC workspace.
In this case, after I dvc add dataset1 and then dvc add dataset1 + dataset2, the size of the cache dir is 15M(10M + 5M). It seems ok.

[Case 2] Datasets are in the external storage with respect to the DVC workspace. (Cache is in or out of the storage where the DVC workspace is.)
In this case, after I dvc add datset1 and then dvc add dataset1 + dataset2, the size of the cache dir is 25M.(10M + 15M) So, the feature DVC provides for eliminating the cache duplication doesn’t work fine for external dataset.

I checked the size of the folders using du -hs in my Ubuntu environment.
Can I use that DVC feature to eliminate cache duplication when I have an external dataset?

Thank you.

Hello, @schakal!
What format is you dataset in? Is it a directory with files?
What is you dvc version?

Hello, @Paffciu
Yes, each dataset folder is consisted as follows. For instance,

dataset1(folder) - cats(folder) - image files (.jpg)
- dogs(folder) - image files (.jpg)

And my environment is:

DVC version : 2.8.1 (pip)
Python 3.6.9 on Linux-5.4.0-81-generic-x86_64-with-Ubuntu-18.04-bionic

Thank you.

Thank you. Is the dataset that you are using some public one, so that I could try to reproduce it?

Sure. I just used the cats&dogs dataset in the tutorial “https://dvc.org/doc/use-cases/versioning-data-and-model-files/tutorial”.

I downloaded it using

dvc get https://github.com/iterative/dataset-registry \
          tutorials/versioning/data.zip

and from the first 500 images(i.e., cats 1~-500.jpg, dogs 1~500.jpg) I splitted into two sets:
dataset1: cats 1~300.jpg & dogs 1~300.jpg
dataset2: cats 201~500.jpg & dogs 201~500.jpg

And I located them in the same or out of the storage as the workspace and experimented it.

So the exact size of the folders is not same as the above case I described, but the overall situation is same. The above was just the example to make it simple.

Thank you.

I just rerun the whole thing again, and it seems that if the dataset is outside of the workspace and I use “dvc add -o”, then the cache doesn’t consider the duplication in the dataset.

That is, even though dataset1 and dataset2 has duplication of 5M, the size of cache is 29M, not 25M, and it’s unreasonable.

That is indeed unreasonable, let me try to reproduce that.

@schakal
I cannot reproduce, for me the behavior is as expected, after adding second dataset the cache size is (size_1 + size_2) (and not 2 * size_1 + size_2)
Could you provide me with some more info about Case 2?

Datasets are in the external storage with respect to the DVC workspace.

Data resides in a separate folder, or for example, an external drive?

Cache is in or out of the storage where the DVC workspace is.

I checked both the “default” cache (.dvc/cache) and external dir (cache outside of dvc repo).

In both cases I get consistent results, cache size grows as expected.

@Paffciu
Ah, thank you for checking. Let me explain how I executed it and the results I’ve got.

*data_v1: cats&dogs dataset (1~300.jpg, respectively) - 15M (size)
*data_v2: cats&dogs dataset (201~500.jpg, respectively) - 15M (size)
Therefore, there is 5M duplication between data_v1 and data_v2. (201~300.jpg)

*workspace(denoted as WS below) path: /home/work/dvc_expm/

Our NAS is mounted on /home/work/nas/.
So I set the path for dataset as /home/work/nas/outer_data (denoted as OD below) and
for remote cache as /home/work/nas/outer_remote (denoted as REM below).

The size of all folders is computed through du -hs.

To reproduce the results, I executed as follows:
(1) In the WS,

(2) Copy “data_v1” to the OD using cp -r.

(3) In the WS,

Then, the size of the REM is 15M.
And the size of /home/work/dvc_expm/.dvc/cache is 16M. (since I didn’t set cache dir here.)
(Everything is ok here)

(4) Copy “data_v2” to the OD using cp -r. Now the size of OD is 30M. (data_v1 & data_v2 are separate folders.)

(5) In the WS,

Now the size of the related folders is as follows:

So it seems that, in my case, only the cache of remote eliminates the duplication. (it should be 24M.)
And the phenomenon of overlapping cache (44M) is same when I set ‘cache dir’ in my NAS (ex. dvc cache dir /home/work/nas/outer_cache in the first place.)

Do you mean you can’t reproduce this scenario?

Thank you.

@Paffciu Could I ask if the reproducing is going on?