I’m maintaining a dataset for a task where reproducibility is very important, with a large number of small files (hundreds of thousands, likely to grow towards a million). Files are updated over a few experimental sessions across a few weeks before being permanently finalised once the relevant work is done, so it’s important to have a log of what was changed in each update as well as a way of verifying that nothing’s changed after a file is marked as a final version.
My original plan was to use one .dvc
file per underlying file, but that didn’t go well. Having an on-disk record of every file hash that I’m able to check programatically, as well as a git tracked record of when those hashes changed, would have been ideal - I’m wondering whether there’s a sensible way to achieve a similar result, maybe with per-directory .dvc
files? I guess what I’m really looking for would be a directory-level .dvc
file with filename:hash pairs, but I don’t think that’s an option.
It looks like dvc data status --granular
is a step in that direction, but I’m concerned about the potential for unknown
return values, and I think it’d require an entire commit to be checked out in order to be used rather than having every file-level change visible in git?
@moonbuggy
could you give a bit more details? are you using some cloud to store original files?
I guess what I’m really looking for would be a directory-level .dvc
file with filename:hash pairs, but I don’t think that’s an option.
I feel that DataChain can serve you better - GitHub - shcheklein/example-datachain-dvc: An example how to use DataChain and DVC to version data, make project reproducible, track experiments and models - PTAL and let me know what you think. It can create essentially a table with filename:hash which you will also able to diff between commits. Happy to jump on a call and discuss the details / brainstorm how we can achieve what you need.
Thanks for that, I hadn’t come across DataChain previously and it looks like it could potentially be a good option here.
The data’s being created locally - the intention was to use dvc as the mechanism to push it to cloud storage in a versioned, provider-agnostic way. Each local simulation run will create a few hundred new output files as well as appending to some existing ones, and then I’m looking for a way to cleanly snapshot the full dataset at that point and be able to pinpoint what was changed, what was added, and what was left as-is after any given simulation.
If I’m understanding the repo you linked and the DataChain docs correctly, it looks like a key difference here is relying on the cloud provider to handle the per-file versioning rather than trying to do that within the local versioning system? And then from there DataChain keeps track of the list of files and provider-side hashes/version IDs in a parquet file, which can be stored using dvc for a full dataset snapshot at any given point in time?
The only potential downside I see is that it does tightly couple the dataset to the specific cloud storage bucket and provider, as far as I can see. Not a blocker per se, but I’ve always preferred to keep things portable in case a particular provider becomes untenable in future - one of the reasons I like dvc’s ability to just pull from one remote and push to another if required. Also means Cloudflare R2 is out because they don’t support versioning (again, points for letting dvc handle that because it makes R2’s no-egress-fee storage usable for versioned data).
If all my assumptions so far are correct it does look like it could work well, but I’ll definitely want to spend a bit of time understanding the implications of using the cloud provider etags and version IDs, and what the worst case would be for shifting the version history across to an alternative provider in future if needed.