What's causing my pre-commit hook to write so much data?

I’ve got a dvc repo with ~300k small files (~70GB data) - I realise now this is a bad idea, so I’m in the process of trying to understand a bit more and optimise it without breaking anything!

There’s currently a pre-commit hook running that started about 12 hours ago, and from what I can see it’s in a pattern of starting a python subprocess, reading about 2GB from disk, then writing about 50GB, and repeating that same series of steps.

What I’m not clear on is why a status check would be writing significant amounts of data, and especially why it would be writing multiple times the overall size of the dataset for a single hook? I’m concerned about causing issues if I interrupt it while writing, but I’m also not sure if this is a “it’ll finish some time tomorrow and then I can start optimising things” issue or a “it’s in a loop that’ll run until the heat death of the universe” issue.

Would very much appreciate any info that helps me get a better handle on what’s going on!

hmm, this is weird tbh. I also don’t see a reason for it to write large amount of data. 70GB doesn’t sound too large and should not be taking that long. Could you run dvc version and share the result?

Also, how many .dvc files or stages in dvc.yaml do you have?

Yeah in terms of data size it’s nothing major - I’ve used dvc before with a few hundred 1GB - 5GB files, so at least 10x more data (but 1000x fewer files), and it’s been great previously.

It’s probably my approach here that’s the issue: I went with one dvc file per tracked file (rather than one per directory), with the intention of getting a separate hash for every item so I could use them from my own code to automatically pinpoint unexpected changes to files that should be static. Total .dvc file count from find . -mindepth 1 -type f -name "*.dvc" -exec printf x \; | wc -c is 325,881.

The initial dvc add and dvc push took a couple of hours total, so I realised I was pushing my luck - the plan from there was get it committed to git as well so there’s a known good state to roll back to, and then do some more reading on the docs to work out a better way to structure things. Pre-commit hook was still running when I got up this morning so I monitored it for a bit and figured it was time to ask someone with a better understanding!

DVC version: 3.59.0 (brew)
--------------------------
Platform: Python 3.13.1 on macOS-15.2-arm64-arm-64bit-Mach-O
Subprojects:
	dvc_data = 3.16.7
	dvc_objects = 5.1.0
	dvc_render = 1.0.2
	dvc_task = 0.40.2
	scmrepo = 3.3.9
Supports:
	azure (adlfs = 2024.12.0, knack = 0.12.0, azure-identity = 1.19.0),
	gdrive (pydrive2 = 1.21.3),
	gs (gcsfs = 2024.12.0),
	hdfs (fsspec = 2024.12.0, pyarrow = 18.1.0),
	http (aiohttp = 3.11.11, aiohttp-retry = 2.9.1),
	https (aiohttp = 3.11.11, aiohttp-retry = 2.9.1),
	oss (ossfs = 2023.12.0),
	s3 (s3fs = 2024.12.0, boto3 = 1.35.93),
	ssh (sshfs = 2024.9.0),
	webdav (webdav4 = 0.10.0),
	webdavs (webdav4 = 0.10.0),
	webhdfs (fsspec = 2024.12.0)
Config:
	Global: /Users/moonbuggy/Library/Application Support/dvc
	System: /opt/homebrew/share/dvc

Just made a discovery: the .dvc/tmp/rwlock file is continually jumping from 0B to 131kB, and 131kB * 325,000 ≈ 43GB, so if the content of that file is much longer than usual because it refers to every .dvc file in the repo(?) and the lock file gets wiped and recreated each time a new .dvc file is processed, that’d more or less explain the write volume.

Not sure what that says for the bigger picture yet, though.

Still going in the same pattern after 36 hours, so I just ctrl-c’d it and nothing unusual seems to have happened - all of the writes seem to have been in the cache directory, and the data had already been pushed to the remote, so worst case that can be recreated. I disabled the hook and committed all the .dvc files to git, so at least now there’s a known snapshot of the data in the dvc remote and of the hashes in git.

I’ve started a separate thread asking about the underlying goal I was trying to achieve with this setup - would be interested if you had any thoughts on that, @shcheklein?