Dealing with large numbers of small files

Seanny123 · April 23, 2020, 4:12pm

A project I’m working on has many small files, which don’t work well with DVC. Instead, I zip up all the files and push them remotely. However, it would be nice to know if I’ve modified any of the files in the zipped folder and diff them with the remote. Am I supposed to be using a pipeline to accomplish this? Is this just a silly goal to have?

pmrowla · April 24, 2020, 1:52am

Hi @Seanny123!

DVC does not currently provide any way to compare the contents of a zip file between your local copy and the remote. However, the next release of DVC will include several optimizations that substantially improve performance for handling many small files. Ideally, you should be able to push your data directly (without using an intermediate zip file), and run dvc status -c to see which files differ between your local copy and the remote.

It sounds like your data set has a large number of files, but only expect that a small percentage of them will have been modified each time you want to push to the remote? Performance in this case will be significantly improved in the upcoming release, and I would recommend upgrading DVC whenever the new version is released to see if the performance without using an intermediate zip meets your needs.

Seanny123 · April 24, 2020, 2:07am

Are these changes already in the master branch? Alternatively, could you please link me to the Pull Request where these changes are being prototyped?

pmrowla · April 24, 2020, 2:22am

These changes are available in the master branch.

If you are interested, https://github.com/iterative/dvc/pull/3634 also contains some benchmarks (using an S3 remote) for the new/optimized behavior.

Seanny123 · May 15, 2020, 3:42pm

Those changes work great for adding many small files. However, when pushing them, they are still pushed individually when it would be helpful to zip them together before pushing. Is this “zip before pushing” functionality something I should be able to do with a pipeline?

jorgeorpinel · May 18, 2020, 6:51pm

@Seanny123 that’s an existing feature request, please upvote it and feel to comment your case/support for it on Github: https://github.com/iterative/dvc/issues/1239 Thanks!

shcheklein · May 19, 2020, 5:05pm

@Seanny123 could elaborate why do you think zipping them together work better in your case?

To my mind - treating them individually should be better for two reasons:

We can upload in parallel in a very aggressive manner (--jobs options)
DVC can upload only delta (only new files) instead of the whole bundle every time.

Seanny123 · May 19, 2020, 6:00pm

I mostly just wanted to reduce the amount of time it took to upload a large file set of files. Specifically, the whole folder is 20GBs. However, this might be a case wherein I’m abusing DVC. Additionally, I haven’t checked if I’m using the --jobs option, so this might just be premature optimization on my end.

shcheklein · May 19, 2020, 11:48pm

20GB does not sound like an abuse. But I still hope that pushing/pulling them as a directory should be faster. DVC still takes times to check change in the directory before pushing/pulling and it might take a bit of time, but overall it should still be better experience, especially if expect removing/adding small amount of files to the dir.

Topic		Replies	Views
Advice for versioning many many small files? Questions	8	3609	January 13, 2021
Best practice for handling large data Questions	5	2582	April 16, 2021
Dvc add and push after adding a couple of images Questions	3	625	November 27, 2023
Hi everyone! First question - How to point multiple projects to single dataset? Questions	5	1411	February 17, 2021
Timing to create a dvc repo for a 60GB dataset? Questions	15	949	March 21, 2022

Dealing with large numbers of small files

Related topics