Fill-back metrics

Say I am maintaining some data using dvc, and at some point decide I want to have a metric showing some data statistics (i.e. how many positive samples I have). Can I back-fill this metric to previous commits? (the goal is to track the number of positives I had after each data-commit).

If the commit history is A->B->C, I know I can go back to any commit and run a pipeline, but the metric output of this pipeline will need to be saved in a different commit, right? So I will have to create new git commits: A->A’ B->B’, C->C’ that will store the metric results, or is there a different way?

2 Likes

Very interesting question @jonilaserson !

I think it’s fundamentally a Git question, as versioning is done entirely with Git. (DVC does the tracking of data via placeholders in small dvc.lock and .dvc files.)

So I will have to create new git commits: A->A’ B->B’, C->C’ that will store the metric results, or is there a different way?

Correct. This can be achieved via rebase, I believe. But it may imply rewriting the commit history — sometimes frowned upon.

Please feel free to open a feature request for DVC to support this use case directly though! Sign in to GitHub · GitHub

2 Likes

I actually meant to add the three branches like this:
A->A’
|
v
B->B’
|
v
C->C’
And then I won’t need rebase, but I will still need a way to know I should collect the metrics in these “detached” commits.

So C’ is where you add the metrics to C, and then you cherry pick that commit separately onto B and onto A, creating your 2 extra branches.

still need a way to know I should collect the metrics in these “detached” commits

If the metrics are data series (plots), you can use dvc plots diff A' B' C'. See plots diff. But this feature isn’t available for plain metrics yet :slightly_frowning_face: — I just opened an issue for that.