Tracking Data Provenance with DVC

diehl · May 11, 2022, 12:15am

I recently discovered DVC and am looking to replace my current shell script-based approach for downloading source datasets and building derived datasets with DVC. In my current process, I have a clear record of data provenance as my scripted pipelines begin with downloads of the source datasets from the web.

The question I have is the following: does DVC provide functionality that allows me to capture data provenance somehow? Can I record the URL from which the data was originally sourced and bind that to metadata associated with the data file(s)? Or will I need to maintain scripts that allow me to easily reacquire the data from the web? In the documentation, it seems like the story begins with the source data already in hand.

I would love to have functionality that allows me to easily reacquire the source data from the web if needed and verify that indeed the dataset is equivalent to the original form that was used in the original pipeline development (via hash comparison).

As of yet, I’m not quite seeing how I would accomplish this with existing DVC commands. Any pointers would be greatly appreciated!

daavoo · May 18, 2022, 10:50am

Hi @diehl , sorry for late response.

If I understand your use case correctly, it looks like you might be interested in using dvc import (https://dvc.org/doc/command-reference/import) and/or dvc import-url (https://dvc.org/doc/command-reference/import-url)

diehl · May 21, 2022, 6:22pm

Hi @daavoo, No worries!

That sounds on point. Thank you!

Topic		Replies	Views
Tracking data and code dependencies Questions	4	2133	May 18, 2018
Hi everyone! First question - How to point multiple projects to single dataset? Questions	5	1410	February 17, 2021
Help with upgrading imported (via dvc2.x) dvc data with dvc3.0 Questions	21	1296	January 25, 2024
Make DVC projects robust to migration Questions	0	55	May 13, 2024
Looking for Workflow Suggestion Questions	2	183	December 21, 2023

Tracking Data Provenance with DVC

Related topics