DVC and large datasets

rlgebov · July 25, 2025, 7:56am

It is hard to get understanding of the tool just based on an example driven documentation that is why I am asking here.

Use Case: I have a large dataset consisting of hundreds of GB of data and I want to trace how it was created.

Question: Does DVC support just tracing the version of data pipeline and parameters and maybe some summary/fingerprint based data source description? I do not want that DVC looks at the data itself and computes hashes because it is very compute intensive and requires either downloading the data which is not feasible or integrating DVC into the large distributed map reduce pipeline (is this supported at all?) which is very invasive.

Is it possible that DVC is not touching any data stored remotely but still stores all of the information needed to understand the original of the data, origin of the software which was used to create it and configuration changes and local modifications? Is there a working example for this scenario?

Thank you very much for the answers or discussion!

shcheklein · August 13, 2025, 9:05pm

@rlgebov really good question.

In this case I would suggest to maintain some file (parquet, CSV) or DB (sqlite), that has metadata (labels, file names, etc) and version that file instead of the data itself. You can also enable data versioning on the cloud / bucket level and capture version ids that cloud provides to be able to capture an actual snapshot. All of this can be done as single step in DVC pipeline.

Re examples, the closest that I have is this one (a bit outdated, but it highlights the idea behind the “metadata“ file approach) - GitHub - shcheklein/example-datachain-dvc: An example how to use DataChain and DVC to version data, make project reproducible, track experiments and models. DataChain there serves as tool to produce those snapshots. DVC is the way to version, run pipelines, experiments, store models.

Let me know what your thoughts are on this approach.

Topic		Replies	Views
DVC local storage usecase Questions	6	1634	January 20, 2021
How does DVC do version checking? Questions	6	1451	April 2, 2019
Data versioning of databases Questions	1	783	April 6, 2018
Does dvc work for live streaming data versioning and batch data versioning ? If yes, can someone explain briefly Questions	3	1651	April 26, 2021
Can dvc support remote large DBs? Questions	2	1275	June 1, 2021

DVC and large datasets

Related topics