Does DVC duplicate dataset in storage blobs when the same dataset is used in different tags?

subhankar · October 29, 2021, 10:35am

Hi,

Suppose I have a dataset of 5 GB (images and jsons). I version this dataset using DVC and give a tag name dataset_1. I push the dataset to S3.

Now I add a new dataset of 7GB (images and jsons) to the old dataset, take a tag out called dataset_2 and push to the same S3 bucket. So dataset_2 tag contains the old 5GB and the new 7GB data.

How much storage space is being taken by S3. Will it be 12GB (5 +7) or 15 GB (5 + (5 + 7))?
In other words, is the older dataset being duplicated in S3?

Paffciu · October 29, 2021, 10:55am

Hello @subhankar,
The answer to your question depends on how are your datasets stored. DVC asserts granularity on a file level - which means that if given files shows up in different datasets, it will still be stored as single file in the remote. You can read a bit on our cache structure here: https://dvc.org/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory

So if your datasets are directories containing files, your data will take (5+7) GB (assuming that none of the files repeat between those two versions - if they do it’s even better, because DVC will know to store only one copy.

If your dataset is some kind of archive (for example a .tar file) DVC will store each file separately, which means that your remote will have 5 + (5+7) GB.

Topic		Replies	Views
Trying to understand data storage Questions	7	2294	October 30, 2022
Cache duplication for external dataset Questions	9	560	March 14, 2022
Hi everyone! First question - How to point multiple projects to single dataset? Questions	5	1413	February 17, 2021
Total data usage and use of external shared cache Questions	1	382	May 26, 2023
Tracking files stored in S3 without adding it into local storage Questions	4	1083	July 5, 2023

Does DVC duplicate dataset in storage blobs when the same dataset is used in different tags?

Related topics