What's causing my pre-commit hook to write so much data?

I’ve got a dvc repo with ~300k small files (~70GB data) - I realise now this is a bad idea, so I’m in the process of trying to understand a bit more and optimise it without breaking anything!

There’s currently a pre-commit hook running that started about 12 hours ago, and from what I can see it’s in a pattern of starting a python subprocess, reading about 2GB from disk, then writing about 50GB, and repeating that same series of steps.

What I’m not clear on is why a status check would be writing significant amounts of data, and especially why it would be writing multiple times the overall size of the dataset for a single hook? I’m concerned about causing issues if I interrupt it while writing, but I’m also not sure if this is a “it’ll finish some time tomorrow and then I can start optimising things” issue or a “it’s in a loop that’ll run until the heat death of the universe” issue.

Would very much appreciate any info that helps me get a better handle on what’s going on!

hmm, this is weird tbh. I also don’t see a reason for it to write large amount of data. 70GB doesn’t sound too large and should not be taking that long. Could you run dvc version and share the result?

Also, how many .dvc files or stages in dvc.yaml do you have?

Yeah in terms of data size it’s nothing major - I’ve used dvc before with a few hundred 1GB - 5GB files, so at least 10x more data (but 1000x fewer files), and it’s been great previously.

It’s probably my approach here that’s the issue: I went with one dvc file per tracked file (rather than one per directory), with the intention of getting a separate hash for every item so I could use them from my own code to automatically pinpoint unexpected changes to files that should be static. Total .dvc file count from find . -mindepth 1 -type f -name "*.dvc" -exec printf x \; | wc -c is 325,881.

The initial dvc add and dvc push took a couple of hours total, so I realised I was pushing my luck - the plan from there was get it committed to git as well so there’s a known good state to roll back to, and then do some more reading on the docs to work out a better way to structure things. Pre-commit hook was still running when I got up this morning so I monitored it for a bit and figured it was time to ask someone with a better understanding!

DVC version: 3.59.0 (brew)
--------------------------
Platform: Python 3.13.1 on macOS-15.2-arm64-arm-64bit-Mach-O
Subprojects:
	dvc_data = 3.16.7
	dvc_objects = 5.1.0
	dvc_render = 1.0.2
	dvc_task = 0.40.2
	scmrepo = 3.3.9
Supports:
	azure (adlfs = 2024.12.0, knack = 0.12.0, azure-identity = 1.19.0),
	gdrive (pydrive2 = 1.21.3),
	gs (gcsfs = 2024.12.0),
	hdfs (fsspec = 2024.12.0, pyarrow = 18.1.0),
	http (aiohttp = 3.11.11, aiohttp-retry = 2.9.1),
	https (aiohttp = 3.11.11, aiohttp-retry = 2.9.1),
	oss (ossfs = 2023.12.0),
	s3 (s3fs = 2024.12.0, boto3 = 1.35.93),
	ssh (sshfs = 2024.9.0),
	webdav (webdav4 = 0.10.0),
	webdavs (webdav4 = 0.10.0),
	webhdfs (fsspec = 2024.12.0)
Config:
	Global: /Users/moonbuggy/Library/Application Support/dvc
	System: /opt/homebrew/share/dvc

Just made a discovery: the .dvc/tmp/rwlock file is continually jumping from 0B to 131kB, and 131kB * 325,000 ≈ 43GB, so if the content of that file is much longer than usual because it refers to every .dvc file in the repo(?) and the lock file gets wiped and recreated each time a new .dvc file is processed, that’d more or less explain the write volume.

Not sure what that says for the bigger picture yet, though.

Still going in the same pattern after 36 hours, so I just ctrl-c’d it and nothing unusual seems to have happened - all of the writes seem to have been in the cache directory, and the data had already been pushed to the remote, so worst case that can be recreated. I disabled the hook and committed all the .dvc files to git, so at least now there’s a known snapshot of the data in the dvc remote and of the hashes in git.

I’ve started a separate thread asking about the underlying goal I was trying to achieve with this setup - would be interested if you had any thoughts on that, @shcheklein?

Since I had the repo set up anyway, I’ve done some benchmarking on this just in case the numbers are useful to anyone (12 core / 24 thread workstation with a fast local SSD and gigabit connection to the cloud, so shouldn’t be any significant bottlenecks). Obviously I understand that this was a bad idea from the start on my part - the DataChain approach in the other thread looks promising - but I figure it’s still potentially valuable to get an idea of where the most significant performance wins in dvc itself could be made. I also disabled pre-commit hooks and did everything manually here to avoid ambiguity.


The initial dvc pull of the repo to a clean folder was the slowest operation, at around 35 hours to complete.

From there, I tried using it as I was originally planning to in day to day use: added about 800 files to the existing ~300,000 and altered a further 1,500 of the existing ones.

git status - a few seconds, correctly identifies the new files that aren’t yet tracked by dvc

dvc status - around 10 minutes, correctly identifies the dvc tracked files that have been altered

dvc data status - around 14 hours(!), appears to identify the same changes as dvc status

dvc add --glob 'data/*/foo.bar' - around 70 minutes; I tried this with a couple of different glob patterns, targeting different numbers of changes, and all took about the same amount of time

dvc push - around 10 hours

git add 'data/*/*.dvc' - a few seconds


Things that seemed interesting here:

  • git is still able to operate pretty quickly and identify changes across the repo as a whole - makes me wonder if delegating some of the status checking to git could be an efficient way to speed things up
  • dvc status is still usable - not super fast, but perfectly plausible to run for a few minutes to check the output of a multi-hour simulation; I actually wasn’t expecting it to pinpoint all the file-level changes, and it’s presumably traversing the repo as a whole to do so, so I think I need a better understanding of how it differs from dvc data status
  • This is all running on a single CPU core, and that seems to be the main bottlneck - obviously parallelising would be a big job, perhaps prohibitively so, but it seems like it’d bring even the multi-hour calls down into the realms of a few tens of minutes

Not necessarily expecting any changes or direct followup, but that’s the info I’ve got for anyone who finds this thread later!