I am using DVC to track different versions of a dataset consisting of annotations on a NER dataset. In particular, I would like to keep track of some dataset metadata which may change from one version to another, and I would also like to quickly compare different versions of the dataset through the evolution of these metadata.
The current solution I am adopting is to track the information on the dataset in a README.md but I would like to keep the link with the dataset file more clear than that.
How to do that? I thought about adding this metadata to the dataset.dvc file that is automatically generated, but I am not sure that this is a good idea: would a future dvc add overwrite/delete that for instance?
If you write this metadata in some automated way, maybe you could consider creating {output-name}-metadata alongside the output and treat it as normal DVC output?
Best,
Paweł
If you write this metadata in some automated way, maybe you could consider creating {output-name}-metadata alongside the output and treat it as normal DVC output?
Unfortunately this metadata is not an output of the ML pipeline. On the contrary, this metadata represents various kind information we collect on the dataset itself (e.g. who annotated it, when the annotation took place, using which tool, …).
It is possible to use comments and meta key to store user defined-data inside dvc.yaml file.
I thought that the dvc.yaml was used to represent the DAG of a ML pipeline, but here I am talking about data that does not necessarily interact with any ML pipeline… Does it make sense to put this meta key in the my_dataset.dvc file instead? Or should I do something different?
@fra-csl
If its something (for example) too big for *.dvc/*.yaml file, and you are afraid it will make reading pipeline file hard, I would consider
Using your md files approach, but utilizing meta to keep path to your md files
so that:
you have data file
you create data-metadata
in data.dvc add
meta:
metadata-path: data-metadata
That way you keep using your current approach, but also add link between data and its metadata.
Another alternative options (a bit counterintuitive ) is to utilize our params.yaml file. DVC is not opinionated what do you put there and has already built-in capabilities to show you a diff, a table with values from this file, etc. I would explore this option.
What are the current situations when DVC overwrites the .dvc file (thus removing the meta field)?
Running “dvc add/import/import-url” on that same dataset will overwrite them, i.e. when those commands will create “.dvc” file of the same name. Look at the end of this section for more information.
cc @fra-csl
Meta is a good workaround that potentially can help you. But it might worth investigating this use case deeper and I’d appreciate it if you can help us. It feels like you need additional functionality. Also, I have a feeling that some other scenarios can benefit from this feature.
We might have more requirements like not overwriting the labels or inherent some types of metadata in ML models or processed datasets that were based on these signals.
Some data platforms support a similar concept as data labels and provide data lineage for inheriting the labels properly. For example, if an ML model was based on a dataset with labels user-personal-information or confidential then these labels have to be assigned to the model.
We should probably think about introducing data labeling in DVC with a set of rules and commands to support it properly.
@fra-csl what would be your requirement for this kind of feature? As far as I can see:
DVC should not change the labels when the dataset changes (dvc add)
Label inheritance is not needed for labels like who annotated it, when the annotation took place, using which tool
For some types of labels the inheritance is required.
The labels can be in *.dvc files as well as dvc.yaml