Advice for versioning many many small files?

Hello! I’m new to DVC :slight_smile:
I was working on versioning a large dataset (totaling ~140gb) which consists of many many very small files. They are mp3 audio files (imagine for training a speech model), each for example 100-200kb (they’re only 5-15 seconds each on average). This means I’ve got a lot of files of course

I ran dvc push and let it run for ~16 hours before realizing this probably isn’t a great idea, since it seemed like it was nowhere near completion, and cancelled the run. I found a couple links online telling me dvc isn’t ideal for many small files, but I also found a comment from a dvc maintainer on this forum saying recent changes (that post was ~apr 2020) should improve performance for dvc push on many small files – Would anyone be able to advise me? Did I possibly do something wrong, or should I be rolling these files into a tar, zip, etc and versioning that? Wouldn’t it make more sense to version individual data files? (I’m new to working with data too :upside_down_face:)

Help would be really appreciated! Thanks

1 Like

Hi @uzair !

Could you show output for dvc doctor, please?

hey @kupruser thanks for offering help! sorry for the late response, here’s the output of dvc doctor:

DVC version: 1.9.1 (pip)

Platform: Python 3.8.5 on Linux-5.4.0-58-generic-x86_64-with-glibc2.29
Supports: http, https, s3, ssh

Thank you :slight_smile:

Hi @uzair!

What type of remote are you using? Did you check, by any chance, how much time would transferring your data take using other tool?

hi @Paffciu !

Honestly, I have not checked. However, according to speedtest I have 900Mbps/112.5MBps upload speed, and so if I’m not mistaken that should take 24ish minutes for just the raw data, although I understand there would be a cap when uploading to the remote probably (s3 in my case), establishing connections, etc etc.

I found this github PR where @efiop (github username, not sure who this is but they seem to be a DVC maintainer) mentioned DVC isn’t ideal for many many small files, because it has to keep reestablishing a connection to the remote for each file or something? dvc cache push is slow for many files · Issue #497 · iterative/dvc · GitHub

So should I be rolling my files into something like a tar file and be versioning that instead of raw small audio files?

@uzair
This issue is from 2 years ago, and since then we did a fair share of optimizations, and nowadays it should not be the case. In yours, it seems to be something wrong, 140 GB is not that much, to justify 16 hours of one command execution. I’ll do some scaled-down test and get back to you.

1 Like

@uzair
So, I’ve been playing around and it seems that indeed, a lot of small files is still painful.
I run my own tests on 140 mb dataset, 70k files:
Using DVC with default JOBS vaue (4*cpu num) - 1100 s
plain aws s3 cp --recursive - 3000s

Even though dvc spends a lot of time acquiring lock in first case, transfer is much faster, than if we were to reduce number of jobs.

So my suggestion is to play around with number of jobs (dvc push --jobs {X}) - that might help to some extend. Regretfully it seems that a lot of files is still painfully slow.
Some solution might be to pack the files, but that will be at cost of cache size, when we decide to update the dataset.

@uzair please tell me if any of this solutions could help you. You can also chime in original issue for dir optimizations and share your problem: https://github.com/iterative/dvc/issues/1970. We might need to reconsider current state of optimizations and think if there is something more to be done.

I’m struggling with the same issue. Pulling a directory of many files is like an order of magnitude slower than pulling a zip of the same directory. The size of the zip is basically the same as the total size of the individual files so there’s not really any compression going on. It’s just the overhead of doing the files one by one

1 Like

For the record, replied in dvc: performance optimization for directories · Issue #1970 · iterative/dvc · GitHub