Add dataset metadata in .dvc file?

fra-csl · September 16, 2020, 10:12am

Dear Support,

I am using DVC to track different versions of a dataset consisting of annotations on a NER dataset. In particular, I would like to keep track of some dataset metadata which may change from one version to another, and I would also like to quickly compare different versions of the dataset through the evolution of these metadata.

The current solution I am adopting is to track the information on the dataset in a README.md but I would like to keep the link with the dataset file more clear than that.

How to do that? I thought about adding this metadata to the dataset.dvc file that is automatically generated, but I am not sure that this is a good idea: would a future dvc add overwrite/delete that for instance?

Thank you in advance for your help!

Best,

—Francesco

Paffciu · September 16, 2020, 12:30pm

@fra-csl Hi!
So, as I understand, this metadata is something that you create manually?
It is possible to use comments and meta key to store user defined-data inside dvc.yaml file.
Take a look at this doc page:
https://dvc.org/doc/user-guide/dvc-files-and-directories#dvcyaml-file
for usage example.

If you write this metadata in some automated way, maybe you could consider creating {output-name}-metadata alongside the output and treat it as normal DVC output?
Best,
Paweł

fra-csl · September 16, 2020, 3:04pm

Dear Paweł,

Thank you for your prompt reply.

If you write this metadata in some automated way, maybe you could consider creating {output-name}-metadata alongside the output and treat it as normal DVC output?

Unfortunately this metadata is not an output of the ML pipeline. On the contrary, this metadata represents various kind information we collect on the dataset itself (e.g. who annotated it, when the annotation took place, using which tool, …).

It is possible to use comments and meta key to store user defined-data inside dvc.yaml file.

I thought that the dvc.yaml was used to represent the DAG of a ML pipeline, but here I am talking about data that does not necessarily interact with any ML pipeline… Does it make sense to put this meta key in the my_dataset.dvc file instead? Or should I do something different?

Best,

Francesco

Paffciu · September 16, 2020, 3:19pm

@fra-csl
If its something (for example) too big for *.dvc/*.yaml file, and you are afraid it will make reading pipeline file hard, I would consider
Using your md files approach, but utilizing meta to keep path to your md files
so that:

you have data file
you create data-metadata
in data.dvc add

meta:
  metadata-path: data-metadata

That way you keep using your current approach, but also add link between data and its metadata.

What do you think about this approach?
Best
Paweł

skshetry · September 16, 2020, 3:29pm

I’ll second what @Paffciu said with meta keyword, it is for keeping metadata as you said you require.

You could add anything inside meta and dvc will preserve it for the most part.
Eg:

outs:
- md5: md5
  path: data
meta:
  description: |
     Keep long form of description
     that spans multiple lines
  annotated_by: "@someone"

shcheklein · September 16, 2020, 7:53pm

What are the current situations when DVC overwrites the .dvc file (thus removing the meta field)?

and I would also like to quickly compare different versions of the dataset through the evolution of these metadata.

@fra-csl could you five an example of this?

Another alternative options (a bit counterintuitive ) is to utilize our params.yaml file. DVC is not opinionated what do you put there and has already built-in capabilities to show you a diff, a table with values from this file, etc. I would explore this option.

skshetry · September 17, 2020, 2:28pm

What are the current situations when DVC overwrites the .dvc file (thus removing the meta field)?

Running “dvc add/import/import-url” on that same dataset will overwrite them, i.e. when those commands will create “.dvc” file of the same name.
Look at the end of this section for more information.
cc @fra-csl

dmitry · September 17, 2020, 6:02pm

@fra-csl, this is a very good use case.

Meta is a good workaround that potentially can help you. But it might worth investigating this use case deeper and I’d appreciate it if you can help us. It feels like you need additional functionality. Also, I have a feeling that some other scenarios can benefit from this feature.

We might have more requirements like not overwriting the labels or inherent some types of metadata in ML models or processed datasets that were based on these signals.

Some data platforms support a similar concept as data labels and provide data lineage for inheriting the labels properly. For example, if an ML model was based on a dataset with labels user-personal-information or confidential then these labels have to be assigned to the model.

We should probably think about introducing data labeling in DVC with a set of rules and commands to support it properly.

@fra-csl what would be your requirement for this kind of feature? As far as I can see:

DVC should not change the labels when the dataset changes (dvc add)
Label inheritance is not needed for labels like who annotated it, when the annotation took place, using which tool
For some types of labels the inheritance is required.
The labels can be in *.dvc files as well as dvc.yaml

Topic		Replies	Views
How to handle general metadata without experiments? Questions	6	786	February 18, 2021
Implementing simple snapshots of file moves, renames, deletes Questions	3	34	October 14, 2024
Managing Labels/Annotations Questions	4	92	January 3, 2025
Read only mode add dataset Questions	8	1385	February 17, 2021
DVC Heartbeat - Discord gems Announcements	3	4165	June 27, 2019

Add dataset metadata in .dvc file?

Related topics