It is hard to get understanding of the tool just based on an example driven documentation that is why I am asking here.
Use Case: I have a large dataset consisting of hundreds of GB of data and I want to trace how it was created.
Question: Does DVC support just tracing the version of data pipeline and parameters and maybe some summary/fingerprint based data source description? I do not want that DVC looks at the data itself and computes hashes because it is very compute intensive and requires either downloading the data which is not feasible or integrating DVC into the large distributed map reduce pipeline (is this supported at all?) which is very invasive.
Is it possible that DVC is not touching any data stored remotely but still stores all of the information needed to understand the original of the data, origin of the software which was used to create it and configuration changes and local modifications? Is there a working example for this scenario?
Thank you very much for the answers or discussion!