Maximum data size

skshetry · November 29, 2023, 7:03am

We have users tracking TBs of data and datasets to a million files with DVC. It depends on how frequently your dataset changes. If it’s mostly static, you can usually wait for operations to finish, as it’s a one time thing.

DVC supports pulling partial datasets and operating (adding/removing) from the partial datasets that you can later push. See Modifying Large Datasets.

So you don’t have to hash the whole dataset. Even then, dvc caches hash of the files, so it’s not computed again (unless the mtime of the file changes).

If the files are tiny, but you have hundreds of thousands of files, then dvc push/pull will be slower as transferring large number of files are slow.

Also note most users keep their large datasets in shared S3/NFS, etc and use smaller dvc repositories as dataset registries and model registries and import part of the dataset from their shared storages.

Topic		Replies	Views
What type of protocole for huge dataset Questions	2	523	March 23, 2021
Advice for versioning many many small files? Questions	8	3647	January 13, 2021
Check size prior to download Questions	1	958	November 19, 2020
DVC local storage usecase Questions	6	1634	January 20, 2021
Best practice for handling large data Questions	5	2655	April 16, 2021

Maximum data size

Related topics