Say I am maintaining some data using dvc, and at some point decide I want to have a metric showing some data statistics (i.e. how many positive samples I have). Can I back-fill this metric to previous commits? (the goal is to track the number of positives I had after each data-commit).
If the commit history is A->B->C, I know I can go back to any commit and run a pipeline, but the metric output of this pipeline will need to be saved in a different commit, right? So I will have to create new git commits: A->A’ B->B’, C->C’ that will store the metric results, or is there a different way?
Very interesting question @jonilaserson !
I think it’s fundamentally a Git question, as versioning is done entirely with Git. (DVC does the tracking of data via placeholders in small dvc.lock and .dvc files.)
So I will have to create new git commits: A->A’ B->B’, C->C’ that will store the metric results, or is there a different way?
Correct. This can be achieved via rebase, I believe. But it may imply rewriting the commit history — sometimes frowned upon.
Please feel free to open a feature request for DVC to support this use case directly though! https://github.com/iterative/dvc/issues/new/choose
I actually meant to add the three branches like this:
And then I won’t need rebase, but I will still need a way to know I should collect the metrics in these “detached” commits.
So C’ is where you add the metrics to C, and then you cherry pick that commit separately onto B and onto A, creating your 2 extra branches.
still need a way to know I should collect the metrics in these “detached” commits
If the metrics are data series (plots), you can use
dvc plots diff A' B' C'. See https://dvc.org/doc/command-reference/plots/diff. But this feature isn’t available for plain metrics yet — I just opened an issue for that.