Hello,
What is intended to be the maximum dataset size? At some point hashing a large dataset becomes infeasible, right?
Hello,
What is intended to be the maximum dataset size? At some point hashing a large dataset becomes infeasible, right?
There is no strict limit anywhere in dvc. All depends on your setup and how long it is acceptable for you to wait for operations to finish
what is the largest dataset you would consider practical to use DVC with?
We have users tracking TBs of data and datasets to a million files with DVC. It depends on how frequently your dataset changes. If it’s mostly static, you can usually wait for operations to finish, as it’s a one time thing.
DVC supports pulling partial datasets and operating (adding/removing) from the partial datasets that you can later push. See Modifying Large Datasets.
So you don’t have to hash the whole dataset. Even then, dvc caches hash of the files, so it’s not computed again (unless the mtime of the file changes).
If the files are tiny, but you have hundreds of thousands of files, then dvc push/pull
will be slower as transferring large number of files are slow.
Also note most users keep their large datasets in shared S3/NFS, etc and use smaller dvc repositories as dataset registries and model registries and import part of the dataset from their shared storages.
Ok thanks for the detailed answer! very interesting