Add dataset metadata in .dvc file?

Dear Support,

I am using DVC to track different versions of a dataset consisting of annotations on a NER dataset. In particular, I would like to keep track of some dataset metadata which may change from one version to another, and I would also like to quickly compare different versions of the dataset through the evolution of these metadata.

The current solution I am adopting is to track the information on the dataset in a README.md but I would like to keep the link with the dataset file more clear than that.

How to do that? I thought about adding this metadata to the dataset.dvc file that is automatically generated, but I am not sure that this is a good idea: would a future dvc add overwrite/delete that for instance?

Thank you in advance for your help!

Best,

—Francesco

1 Like

@fra-csl Hi!
So, as I understand, this metadata is something that you create manually?
It is possible to use comments and meta key to store user defined-data inside dvc.yaml file.
Take a look at this doc page:
https://dvc.org/doc/user-guide/dvc-files-and-directories#dvcyaml-file
for usage example.

If you write this metadata in some automated way, maybe you could consider creating {output-name}-metadata alongside the output and treat it as normal DVC output?
Best,
Paweł

Dear Paweł,

Thank you for your prompt reply.

If you write this metadata in some automated way, maybe you could consider creating {output-name}-metadata alongside the output and treat it as normal DVC output?

Unfortunately this metadata is not an output of the ML pipeline. On the contrary, this metadata represents various kind information we collect on the dataset itself (e.g. who annotated it, when the annotation took place, using which tool, …).

It is possible to use comments and meta key to store user defined-data inside dvc.yaml file.

I thought that the dvc.yaml was used to represent the DAG of a ML pipeline, but here I am talking about data that does not necessarily interact with any ML pipeline… Does it make sense to put this meta key in the my_dataset.dvc file instead? Or should I do something different?

Best,

Francesco

@fra-csl
If its something (for example) too big for *.dvc/*.yaml file, and you are afraid it will make reading pipeline file hard, I would consider
Using your md files approach, but utilizing meta to keep path to your md files
so that:

  • you have data file
  • you create data-metadata
  • in data.dvc add
meta:
  metadata-path: data-metadata

That way you keep using your current approach, but also add link between data and its metadata.

What do you think about this approach?
Best
Paweł

I’ll second what @Paffciu said with meta keyword, it is for keeping metadata as you said you require.

You could add anything inside meta and dvc will preserve it for the most part.
Eg:

outs:
- md5: md5
  path: data
meta:
  description: |
     Keep long form of description
     that spans multiple lines
  annotated_by: "@someone"
1 Like

What are the current situations when DVC overwrites the .dvc file (thus removing the meta field)?

and I would also like to quickly compare different versions of the dataset through the evolution of these metadata.

@fra-csl could you five an example of this?

Another alternative options (a bit counterintuitive :slight_smile: ) is to utilize our params.yaml file. DVC is not opinionated what do you put there and has already built-in capabilities to show you a diff, a table with values from this file, etc. I would explore this option.

What are the current situations when DVC overwrites the .dvc file (thus removing the meta field)?

Running “dvc add/import/import-url” on that same dataset will overwrite them, i.e. when those commands will create “.dvc” file of the same name.
Look at the end of this section for more information.
cc @fra-csl

1 Like

@fra-csl, this is a very good use case.

Meta is a good workaround that potentially can help you. But it might worth investigating this use case deeper and I’d appreciate it if you can help us. It feels like you need additional functionality. Also, I have a feeling that some other scenarios can benefit from this feature.

We might have more requirements like not overwriting the labels or inherent some types of metadata in ML models or processed datasets that were based on these signals.

Some data platforms support a similar concept as data labels and provide data lineage for inheriting the labels properly. For example, if an ML model was based on a dataset with labels user-personal-information or confidential then these labels have to be assigned to the model.

We should probably think about introducing data labeling in DVC with a set of rules and commands to support it properly.

@fra-csl what would be your requirement for this kind of feature? As far as I can see:

  1. DVC should not change the labels when the dataset changes (dvc add)
  2. Label inheritance is not needed for labels like who annotated it, when the annotation took place, using which tool
  3. For some types of labels the inheritance is required.
  4. The labels can be in *.dvc files as well as dvc.yaml