Does DVC duplicate dataset in storage blobs when the same dataset is used in different tags?

Hi,

Suppose I have a dataset of 5 GB (images and jsons). I version this dataset using DVC and give a tag name dataset_1. I push the dataset to S3.

Now I add a new dataset of 7GB (images and jsons) to the old dataset, take a tag out called dataset_2 and push to the same S3 bucket. So dataset_2 tag contains the old 5GB and the new 7GB data.

How much storage space is being taken by S3. Will it be 12GB (5 +7) or 15 GB (5 + (5 + 7))?
In other words, is the older dataset being duplicated in S3?

Hello @subhankar,
The answer to your question depends on how are your datasets stored. DVC asserts granularity on a file level - which means that if given files shows up in different datasets, it will still be stored as single file in the remote. You can read a bit on our cache structure here: https://dvc.org/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory

So if your datasets are directories containing files, your data will take (5+7) GB (assuming that none of the files repeat between those two versions - if they do it’s even better, because DVC will know to store only one copy.

If your dataset is some kind of archive (for example a .tar file) DVC will store each file separately, which means that your remote will have 5 + (5+7) GB.

1 Like