Tracking Data Provenance with DVC

I recently discovered DVC and am looking to replace my current shell script-based approach for downloading source datasets and building derived datasets with DVC. In my current process, I have a clear record of data provenance as my scripted pipelines begin with downloads of the source datasets from the web.

The question I have is the following: does DVC provide functionality that allows me to capture data provenance somehow? Can I record the URL from which the data was originally sourced and bind that to metadata associated with the data file(s)? Or will I need to maintain scripts that allow me to easily reacquire the data from the web? In the documentation, it seems like the story begins with the source data already in hand.

I would love to have functionality that allows me to easily reacquire the source data from the web if needed and verify that indeed the dataset is equivalent to the original form that was used in the original pipeline development (via hash comparison).

As of yet, I’m not quite seeing how I would accomplish this with existing DVC commands. Any pointers would be greatly appreciated!

Hi @diehl , sorry for late response.

If I understand your use case correctly, it looks like you might be interested in using dvc import (https://dvc.org/doc/command-reference/import) and/or dvc import-url (https://dvc.org/doc/command-reference/import-url)

1 Like

Hi @daavoo, No worries!

That sounds on point. Thank you!