Maximum data size

gregstarr · November 28, 2023, 3:36pm

Hello,

What is intended to be the maximum dataset size? At some point hashing a large dataset becomes infeasible, right?

kupruser · November 28, 2023, 9:07pm

There is no strict limit anywhere in dvc. All depends on your setup and how long it is acceptable for you to wait for operations to finish

gregstarr · November 28, 2023, 9:42pm

what is the largest dataset you would consider practical to use DVC with?

skshetry · November 29, 2023, 7:03am

We have users tracking TBs of data and datasets to a million files with DVC. It depends on how frequently your dataset changes. If it’s mostly static, you can usually wait for operations to finish, as it’s a one time thing.

DVC supports pulling partial datasets and operating (adding/removing) from the partial datasets that you can later push. See Modifying Large Datasets.

So you don’t have to hash the whole dataset. Even then, dvc caches hash of the files, so it’s not computed again (unless the mtime of the file changes).

If the files are tiny, but you have hundreds of thousands of files, then dvc push/pull will be slower as transferring large number of files are slow.

Also note most users keep their large datasets in shared S3/NFS, etc and use smaller dvc repositories as dataset registries and model registries and import part of the dataset from their shared storages.

gregstarr · November 29, 2023, 4:43pm

Ok thanks for the detailed answer! very interesting

Topic		Replies	Views
What type of protocole for huge dataset Questions	2	516	March 23, 2021
Dealing with large datasets and file quotas Questions	2	264	February 5, 2024
Advice for versioning many many small files? Questions	8	3625	January 13, 2021
Temporary large datasets in DVC Questions	2	18	February 12, 2025
DVC with video data Questions	10	241	May 7, 2024

Maximum data size

Related topics