Suppose I have a dataset of 5 GB (images and jsons). I version this dataset using DVC and give a tag name dataset_1. I push the dataset to S3.
Now I add a new dataset of 7GB (images and jsons) to the old dataset, take a tag out called dataset_2 and push to the same S3 bucket. So dataset_2 tag contains the old 5GB and the new 7GB data.
How much storage space is being taken by S3. Will it be 12GB (5 +7) or 15 GB (5 + (5 + 7))?
In other words, is the older dataset being duplicated in S3?
So if your datasets are directories containing files, your data will take (5+7) GB (assuming that none of the files repeat between those two versions - if they do it’s even better, because DVC will know to store only one copy.
If your dataset is some kind of archive (for example a .tar file) DVC will store each file separately, which means that your remote will have 5 + (5+7) GB.