Maximum data size

We have users tracking TBs of data and datasets to a million files with DVC. It depends on how frequently your dataset changes. If it’s mostly static, you can usually wait for operations to finish, as it’s a one time thing.

DVC supports pulling partial datasets and operating (adding/removing) from the partial datasets that you can later push. See Modifying Large Datasets.

So you don’t have to hash the whole dataset. Even then, dvc caches hash of the files, so it’s not computed again (unless the mtime of the file changes).

If the files are tiny, but you have hundreds of thousands of files, then dvc push/pull will be slower as transferring large number of files are slow.

Also note most users keep their large datasets in shared S3/NFS, etc and use smaller dvc repositories as dataset registries and model registries and import part of the dataset from their shared storages.