Fill-back metrics

Say I am maintaining some data using dvc, and at some point decide I want to have a metric showing some data statistics (i.e. how many positive samples I have). Can I back-fill this metric to previous commits? (the goal is to track the number of positives I had after each data-commit).

If the commit history is A->B->C, I know I can go back to any commit and run a pipeline, but the metric output of this pipeline will need to be saved in a different commit, right? So I will have to create new git commits: A->A’ B->B’, C->C’ that will store the metric results, or is there a different way?

2 Likes

Very interesting question @jonilaserson !

I think it’s fundamentally a Git question, as versioning is done entirely with Git. (DVC does the tracking of data via placeholders in small dvc.lock and .dvc files.)

So I will have to create new git commits: A->A’ B->B’, C->C’ that will store the metric results, or is there a different way?

Correct. This can be achieved via rebase, I believe. But it may imply rewriting the commit history — sometimes frowned upon.

Please feel free to open a feature request for DVC to support this use case directly though! https://github.com/iterative/dvc/issues/new/choose

2 Likes

I actually meant to add the three branches like this:
A->A’
|
v
B->B’
|
v
C->C’
And then I won’t need rebase, but I will still need a way to know I should collect the metrics in these “detached” commits.

So C’ is where you add the metrics to C, and then you cherry pick that commit separately onto B and onto A, creating your 2 extra branches.

still need a way to know I should collect the metrics in these “detached” commits

If the metrics are data series (plots), you can use dvc plots diff A' B' C'. See https://dvc.org/doc/command-reference/plots/diff. But this feature isn’t available for plain metrics yet :slightly_frowning_face: — I just opened an issue for that.