Best way to pinpoint per-file changes in 300k+ files?

moonbuggy · February 14, 2025, 10:38am

I’m maintaining a dataset for a task where reproducibility is very important, with a large number of small files (hundreds of thousands, likely to grow towards a million). Files are updated over a few experimental sessions across a few weeks before being permanently finalised once the relevant work is done, so it’s important to have a log of what was changed in each update as well as a way of verifying that nothing’s changed after a file is marked as a final version.

My original plan was to use one .dvc file per underlying file, but that didn’t go well. Having an on-disk record of every file hash that I’m able to check programatically, as well as a git tracked record of when those hashes changed, would have been ideal - I’m wondering whether there’s a sensible way to achieve a similar result, maybe with per-directory .dvc files? I guess what I’m really looking for would be a directory-level .dvc file with filename:hash pairs, but I don’t think that’s an option.

It looks like dvc data status --granular is a step in that direction, but I’m concerned about the potential for unknown return values, and I think it’d require an entire commit to be checked out in order to be used rather than having every file-level change visible in git?

shcheklein · February 21, 2025, 5:41pm

@moonbuggy

could you give a bit more details? are you using some cloud to store original files?

I guess what I’m really looking for would be a directory-level .dvc file with filename:hash pairs, but I don’t think that’s an option.

I feel that DataChain can serve you better - GitHub - shcheklein/example-datachain-dvc: An example how to use DataChain and DVC to version data, make project reproducible, track experiments and models - PTAL and let me know what you think. It can create essentially a table with filename:hash which you will also able to diff between commits. Happy to jump on a call and discuss the details / brainstorm how we can achieve what you need.

moonbuggy · February 21, 2025, 8:31pm

Thanks for that, I hadn’t come across DataChain previously and it looks like it could potentially be a good option here.

The data’s being created locally - the intention was to use dvc as the mechanism to push it to cloud storage in a versioned, provider-agnostic way. Each local simulation run will create a few hundred new output files as well as appending to some existing ones, and then I’m looking for a way to cleanly snapshot the full dataset at that point and be able to pinpoint what was changed, what was added, and what was left as-is after any given simulation.

If I’m understanding the repo you linked and the DataChain docs correctly, it looks like a key difference here is relying on the cloud provider to handle the per-file versioning rather than trying to do that within the local versioning system? And then from there DataChain keeps track of the list of files and provider-side hashes/version IDs in a parquet file, which can be stored using dvc for a full dataset snapshot at any given point in time?

The only potential downside I see is that it does tightly couple the dataset to the specific cloud storage bucket and provider, as far as I can see. Not a blocker per se, but I’ve always preferred to keep things portable in case a particular provider becomes untenable in future - one of the reasons I like dvc’s ability to just pull from one remote and push to another if required. Also means Cloudflare R2 is out because they don’t support versioning (again, points for letting dvc handle that because it makes R2’s no-egress-fee storage usable for versioned data).

If all my assumptions so far are correct it does look like it could work well, but I’ll definitely want to spend a bit of time understanding the implications of using the cloud provider etags and version IDs, and what the worst case would be for shifting the version history across to an alternative provider in future if needed.

shcheklein · February 23, 2025, 12:02am

Correct, moreover you can add a mapper to calculate any custom hash on top of it / if needed.

The only potential downside I see is that it does tightly couple the dataset to the specific cloud storage bucket and provider, as far as I can see.

Right. If you include some custom hash (based on size or something) you can potentially rely on it though instead of cloud specific attributes. The only thing that you would need to migrate is file.source attribute.

It won’t solve things like worst case would be for shifting the version history across to an alternative provider. It is a good point.

Let me know if you have more thoughts - it’s an interesting discussion.

moonbuggy · February 26, 2025, 10:19am

Much appreciated! I’ve done a bit more reading around and it looks like rclone’s handling of versions and etags between clouds is pretty robust, so it might actually be safe enough to assume that S3-compatible stores are fungible with the same underlying DataChain index just by doing a full rclone sync with the relevant metadata flags. Backblaze has versioning support as well as a fairly sane egress policy, so that seems like a decent stand in for Cloudflare unless/until they add version handling to R2.

I didn’t see custom mappers on my first look at the docs, so that’s also really good to know; just having something straightforward and locally calculated as an extra identifier if it’s needed as a fallback definitely sounds reassuring. Ensures no loss of indexing in the case where disaster recovery from a local copy is needed - even if everything else fails, those local hashes can be dumped from the parquet file and matched up to the files on disk.

I know it might seem a bit silly worrying about that when it’s a relatively trivial amount of data so far, but in my experience decisions around workflow and tooling tend to be very sticky once they’re made! Last project I was working on ballooned pretty quickly from a relatively modest start to >100TB managed across 20+ datasets and teams, and having portability and multi-cloud support built in from the start absolutely saved our collective asses on a couple of occasions, so avoiding proprietary lock in is always on my mind - if I can get it right on this dataset, it’ll make the next however many with similar structures a whole lot easier.

Topic		Replies	Views
DVC local storage usecase Questions	6	1605	January 20, 2021
Trouble modifying and saving dvc data file which lives outside the repo Questions	22	3568	July 15, 2020
Redundant Data across version and within versions Questions	4	1570	February 2, 2023
Is it Possible to train data in s3 bucket without downloading to local machine with DVC? Questions	11	778	March 22, 2023
How does DVC do version checking? Questions	6	1448	April 2, 2019

Best way to pinpoint per-file changes in 300k+ files?

Related topics