Trying to understand data storage

Hi, I am a newcomer to DVC. I read through the tutorial for Data and Model Versioning, and I want to understand how data is stored.

Specifically, say I pull from my remote storage 1000 images, and add 1000 images to create a new dataset of 2000 images. If I dvc add the new data folder and dvc push, how is it stored in remote? Is it like 1000 images (original folder), and then add 1000 images, or is it 1000 images, and then a folder with 2000 images? The reason for the confusion is because at Get Started: Data Versioning (7:54), it shows that there are 2 folders in the remote google drive. I want to know this because I am a ML engineer for a computer vision company and we would have huge datasets, so we don’t want any duplicates across versions.

Furthermore, if I do a dvc checkout to a previous version (1000 images) of the dataset in my local, do 1000 images just disappear from my newer dataset (2000 images). I see this happen in the above video, where dvc checkout with a previous data/data.xml.dvc causes the data to be 36M instead of 72M. How exactly does dvc do this switch so quickly, and does it work for a folder of images vs just a single xml files?

Sorry if these are stupid questions, I am completely new to this.

1 Like

Hello @lpk!

DVC saves the files in the cache/remote under their respective checksums.
If file gest changed, and we add it again - it will be added to cache as new file because its checksum also changed. So new files in cache for same dataset will be only then when you either add new data or modify existing ones. In your use case we will store altogether 2000 images in the cache (assuming they are not duplicated).

Regarding data removal - when you switch to the previous version and do dvc checkout dvc reads from .dvc file corresponding checksum, understands which images were in folder at this revision, and checks out only those. The rest of images is stored in .dvc/cache and/or remote (if you pushed the data to remote). And when you switch back to “new” version
and do dvc checkout they will be copied from cache to your folder. If you take a look at .dvc/cache you will see that it has probably around 72 mb.

Take a look at https://dvc.org/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory for some more context.

2 Likes

Some additional links that might help also understand the cache structure and how DVC is able to quickly do checkouts:

How exactly does dvc do this switch so quickly, and does it work for a folder of images vs just a single xml files?

If symlinks, hardlinks, or reflinks are enabled and available, DVC is not copying data it just manipulates with those links to “checkout” the version of the files that is needed.

1 Like

Thank you so much for your help!

If it’s ok, I have a follow up clarification.

We plan to store a massive dataset of computer vision images that we plan to update over time. You can imagine it will look something like 100,000 images at the start, with incremental updates of 10,000 images. I was wondering what is the best way to do this? Specifically, if I add sequentially in tranches of 10,000 images with sequential dvc adds and pushes, does it mean that when I want to download the entire dataset, say after 3 tranches, I need to dvc pull individually the original 100,000 images as well as the 3 following tranches? Is there a way for me to combine the reference for multiple dvc pushes? Like say to dvc pull the original and 3 tranches that were pushed in 3 dvc pushes together. Was wondering because one way to do it I imagine is to download the entire dataset and push it back as one dvc push. But that can take quite long, especially with large datasets.

Also, do you know how DVC handles naming conflicts? Like 2 different images with the same name, like 0.jpg. Their checksums are different, but if in the same remote repository, how are they stored?

Hi @lpk . There are certain optimizations you can do for that, but before I answer, may I ask for some details? It would help to better understand if DVC is a good fit for your scenario.

How do you plan to consume / use this data? Do you need the whole dataset to be pulled when you train?

With DVC there are ways to add initial 100K images right into remote w/o downloading them first. Read about this here please.

A bit more complex question is how do we handle updates in this situation. This depends on the answer to the question I asked above.

  • If you plan to have a shared DVC cache or at least a local cache on a machine that is running updates, then this operation of adding 10K should be more or less quick and dvc push will be sending only those 10K.
  • If there is no machine / NFS with a warn cache, then yes, the only way to update unfortunately is to pull first, add images, push. It means that we need to run pull every time. Please upvote and comment in this issue to bump the priority for this.

Coming back to the initial question though. In my experience, usually you would need a way to also create datasets from the whole available data. For example, pick images with a specific class, or smaller amount of images of each class, or images similar to another one. Usually you would have some information available about objects (annotations, labels, etc).

Is it the case for you?

1 Like

In the remote storage they will have different names (based on md5 of their content). There will be no conflict in that sense.

Hi @shcheklein,

Thank you for the reply.

Q: How do you plan to consume / use this data? Do you need the whole dataset to be pulled when you train?
A: As a first cut, we simply want to use DVC to keep track of the dataset changes over time. When we use the dataset, most of the time, we would simply use the latest full dataset. If we do some kind of sub sampling, it would be likely using a script on the pulled data, nothing to do with dvc. So my question is say, we have 20 updates to our dataset over time. Do we have to download, add new data, dvc add and then dvc push 20 times to get the results we want? I’ve read the github, it appears there is a way to do so: Mechanism to update a dataset w/o downloading it first · Issue #4657 · iterative/dvc · GitHub
But it’s not formal yet. No worries, it is not a major need at the moment.

I’ve upvoted the issue.