Advice for versioning many many small files?

Paffciu · January 12, 2021, 6:00pm

@uzair
So, I’ve been playing around and it seems that indeed, a lot of small files is still painful.
I run my own tests on 140 mb dataset, 70k files:
Using DVC with default JOBS vaue (4*cpu num) - 1100 s
plain aws s3 cp --recursive - 3000s

Even though dvc spends a lot of time acquiring lock in first case, transfer is much faster, than if we were to reduce number of jobs.

So my suggestion is to play around with number of jobs (dvc push --jobs {X}) - that might help to some extend. Regretfully it seems that a lot of files is still painfully slow.
Some solution might be to pack the files, but that will be at cost of cache size, when we decide to update the dataset.

@uzair please tell me if any of this solutions could help you. You can also chime in original issue for dir optimizations and share your problem: https://github.com/iterative/dvc/issues/1970. We might need to reconsider current state of optimizations and think if there is something more to be done.

Topic		Replies	Views
Dealing with large numbers of small files Questions	8	2857	May 19, 2020
Access remote data instead of downloading it Questions	8	763	March 3, 2023
DVC local storage usecase Questions	6	1754	January 20, 2021
'dvc push' multiple small files to aws s3 causes timeout error Questions	2	1491	September 2, 2021
Large dataset, dvc pull/add/push jobs options Questions	2	884	February 7, 2023

Advice for versioning many many small files?

Related topics