Dealing with large numbers of small files

A project I’m working on has many small files, which don’t work well with DVC. Instead, I zip up all the files and push them remotely. However, it would be nice to know if I’ve modified any of the files in the zipped folder and diff them with the remote. Am I supposed to be using a pipeline to accomplish this? Is this just a silly goal to have?

3 Likes

Hi @Seanny123!

DVC does not currently provide any way to compare the contents of a zip file between your local copy and the remote. However, the next release of DVC will include several optimizations that substantially improve performance for handling many small files. Ideally, you should be able to push your data directly (without using an intermediate zip file), and run dvc status -c to see which files differ between your local copy and the remote.

It sounds like your data set has a large number of files, but only expect that a small percentage of them will have been modified each time you want to push to the remote? Performance in this case will be significantly improved in the upcoming release, and I would recommend upgrading DVC whenever the new version is released to see if the performance without using an intermediate zip meets your needs.

1 Like

Are these changes already in the master branch? Alternatively, could you please link me to the Pull Request where these changes are being prototyped?

These changes are available in the master branch.

If you are interested, https://github.com/iterative/dvc/pull/3634 also contains some benchmarks (using an S3 remote) for the new/optimized behavior.

2 Likes

Those changes work great for adding many small files. However, when pushing them, they are still pushed individually when it would be helpful to zip them together before pushing. Is this “zip before pushing” functionality something I should be able to do with a pipeline?

@Seanny123 that’s an existing feature request, please upvote it and feel to comment your case/support for it on Github: https://github.com/iterative/dvc/issues/1239 Thanks!

@Seanny123 could elaborate why do you think zipping them together work better in your case?

To my mind - treating them individually should be better for two reasons:

  1. We can upload in parallel in a very aggressive manner (--jobs options)
  2. DVC can upload only delta (only new files) instead of the whole bundle every time.

I mostly just wanted to reduce the amount of time it took to upload a large file set of files. Specifically, the whole folder is 20GBs. However, this might be a case wherein I’m abusing DVC. Additionally, I haven’t checked if I’m using the --jobs option, so this might just be premature optimization on my end.

20GB does not sound like an abuse. But I still hope that pushing/pulling them as a directory should be faster. DVC still takes times to check change in the directory before pushing/pulling and it might take a bit of time, but overall it should still be better experience, especially if expect removing/adding small amount of files to the dir.