DVC and large datasets

It is hard to get understanding of the tool just based on an example driven documentation that is why I am asking here.

Use Case: I have a large dataset consisting of hundreds of GB of data and I want to trace how it was created.

Question: Does DVC support just tracing the version of data pipeline and parameters and maybe some summary/fingerprint based data source description? I do not want that DVC looks at the data itself and computes hashes because it is very compute intensive and requires either downloading the data which is not feasible or integrating DVC into the large distributed map reduce pipeline (is this supported at all?) which is very invasive.

Is it possible that DVC is not touching any data stored remotely but still stores all of the information needed to understand the original of the data, origin of the software which was used to create it and configuration changes and local modifications? Is there a working example for this scenario?

Thank you very much for the answers or discussion!

1 Like

@rlgebov really good question.

In this case I would suggest to maintain some file (parquet, CSV) or DB (sqlite), that has metadata (labels, file names, etc) and version that file instead of the data itself. You can also enable data versioning on the cloud / bucket level and capture version ids that cloud provides to be able to capture an actual snapshot. All of this can be done as single step in DVC pipeline.

Re examples, the closest that I have is this one (a bit outdated, but it highlights the idea behind the “metadata“ file approach) - GitHub - shcheklein/example-datachain-dvc: An example how to use DataChain and DVC to version data, make project reproducible, track experiments and models. DataChain there serves as tool to produce those snapshots. DVC is the way to version, run pipelines, experiments, store models.

Let me know what your thoughts are on this approach.