Versioning predictions

Hi, I am considering to use dvc not only for versioning the training, but also the predictions. Since the project I am working at handles clinical data, it is required to track how a prediction has been produced.

The aim is, given incoming batches of inputs, to version each run of the pipeline, generating a sort of provenance store.
The training pipeline and the one for predictions are separated.

I am considering the following workflow:

  • using multistage feature, the dvc.yaml is defined like the following one (for simplicity only a stage is shown)

    stages:
    stage1:
    foreach: {files} # From params.yaml do: cmd: predict-stage1 inputs/{item} outputs-stage1/{item} deps: - inputs/{item}
    outs:
    - outputs-stage1/${item}

  • Every time a new batch of files is ready, params.yaml is populated with their names and then dvc repro is launched. I am not sure if to keep dvc.lock or delete it before running the pipeline, so that it is referred only to the current inputs.

Do you see some evident drawbacks in using dvc this way? Do you know anyone using dvc for a use case similiar to this one?

Thanks

2 Likes

Hi @mauro

Thanks for sharing your use case, especially since you are interested in “multistages” or “foreach stages” which is a new feature and we’re very interested in understanding how users will adopt it.

It seems to me that there are is no connection between the stages in your dvc.yaml, meaning that while there’s of course a relationship between files in inputs/ and outputs-stage*/ (for a single stage), there’s no relationship between outputs of a stage and the inputs of another. Can you confirm?

If that is so. Does it matter in which order are the stages actually run? Or is it OK if they’re considered parallel (independent from each other) ?

Great question. The short answer is that you don’t need to delete it, as only the stages defined in dvc.yaml are considered by DVC regardless of what’s in dvc.lock. But I can see that in your case there would be lots of unnecessary clutter in there. I’d delete it each time for now (if you do end us using the proposed workflow)… I’ll check with the team about that behavior though.

Only what I suggested earlier. Namely that the stages don’t seem to be connected to one another. DVC is best used with pipelines formed by stages connected by their outputs and dependencies (see dag).

If you basically have one stage at the moment (cmd: predict-stage1), have you considered using a simple stage and expand the input/output files via wildcards and/or xargs (Linux) in the command itself? Something like:

stages:
  stage1:
    cmd: ls -1 inputs-stage1 | xargs -I {} predict-stage1 {} outputs-stage1/
    deps:
      - inputs-stage1
      - outputs-stage1

Caveats:

  • predict-stage1 only receives a directory name as 2nd argument
  • The stage has the entire inputs-stage1 dir as dependency so if any one file changes, repro runs the whole batch again.
  • The stage has the entire outputs-stage1 dir as output.

Hi @jorgeorpinel, thanks for the reply.

Sorry, for simplicity’s sake I didn’t explained well that point: no, stages are linked, outputs of one stage are dependencies for the following one(s).

Ok, thanks.

Actually the script run in cmd accepts a single file or a directory, so I can easily switch to a pipeline with deps and outs that are directories. Anyway, I have the requirement to track how a single file is produced, so running cmd with files as input (using the foreach loop) produces a dvc.lock that tells me such information. Or does it exist another way, given a file and a pipeline that processes directories, to know which inputs (in terms of files, not directories) and stages produced it?

OK. Can you please elaborate on which stages are connected? Are we talking about the stages inside the foreach: being connected to one another (didn’t seem so from the directory names in the given example)? Or each batch of stages depending on the outputs of previous batches?

In the latter case (which I think is the case) my comment about the stages in each foreach being independent remains , and it would be great to know whether the execution order matters to you — because atm DVC doesn’t guarantee any order in those cases (they’re considered “parallel”).

So dvc.lock doesn’t prune stages automatically to avoid frequent merge conflicts in Git. This assumes that users won’t typically have a need to rename stages in a pipeline often (other than for testing/playing around but in that case you can just delete dvc.lock regularly). So far I think that this assumption is still valid but I’m interested to get your feedback!

True, you did mention about data provenance already. In that sense it seems like a pretty helpful way to use foreach stages.

But still I’m confused as to why the stage names need to change in every batch: if the stages in a batch will be connected to following ones, then you’d want to keep them, and add a new foreach stage to dvc.yaml for each batch (possible drawback: dvc.lock may grow too much). If they’re disconnected, you may want to organize them in different directories (a dvc.yaml file per dir) in a repo or even separate repos instead of as Git commits of a same branch (I’m assuming that’s how they’re saved now).

I do see how easy it is to rename them by loading params in foreach: {$files} though… May be an unintended use for our current implementation. Need to check with the team again while we head back from you :slightly_smiling_face:

I think it is time for a better example :slight_smile: Here is part of my pipeline:

stages:
  extract-tissue:
    foreach: ${files} # From params.yaml
    do:
      cmd: extract-tissue.sh ${item} output #output is a dir
      deps:
      - ${item}
      outs:
      - output/${item}/tissue

  predict-tumor:
    foreach: ${files} # From params.yaml
    do:
      cmd: predict-tumor.sh ${item} output/${item}/tissue output
      deps:
      - ${item}
      - output/${item}/tissue
      outs:
      - output/${item}/tumor

So basically stages inside foreach are not connected, while connections between different foreach do exist (predict-tumor is based on the tissue previously extracted).

The files layout where outputs are grouped under directories named output/${item} depends on the fact we are saving files as zarr group of arrays (so under output/${item} there are all the predictions for a given input), and we would like to not drop the current implementation if the drawbacks are manageable.

I don’t fully understand here, I hope the previous example will help.

I expect to process tens of files at once, and to have less than ten foreach stage blocks (extract-tissue and predict-tumor plus some other), so probably dvc.lock will be too large for a human to handle but still machine readable, especially deleting it for every new batch of inputs and using git for the history.

I am probably using dvc in a unintended way, but it seems it has all the staff I need (plus a lot more!).

1 Like

Thanks for the example! So yes, you’re using foreach stages as a way to define a pipeline that processes batches of files “in parallel” (conceptually). I’m not sure it’s the best approach… Let’s see.

Sorry. What I meant is that the ${files} given to foreach translate into stage names in dvc.lock, in your case something like extract-tissue@file1, extract-tissue@file2, etc. So when you overwrite the n files in params.yaml (next batch), dvc repro appends another n stages to dvc.lock. If you go that way, why not keep the old dvc.lock entries as a history that’s always visible?

As an alternative though, you could make a directory (or repo) per batch instead, and put the input files in it, a params.yaml listing them (but at this point why not read the dir list from your code?), and a copy of that dvc.yaml. The resulting dvc.lock files in these dirs would be similar but have completely different stage names. This way also you have visibility of all your batches in the project file structure (instead of having to navigate Git commits).

But even that is a workaround TBH. The ideal scenario would be to avoid using foreach: as a data batch/partitioning tool, but I definitely see how it helps realize your goal of tracking per-file provenance. It’s a good Q! Will probably have to open an issue in our GH repo and discuss with the team. I’ll keep you posted…

Another possible confusion I see has to do with project structure. Why use commits to organize batches, when these don’t (seem to) represent a project version? In that sense also it makes more sense to split the batches into dirs or in some other way. Git could still be useful as a project backup utility, but especially when the code changes (at which point you may want to rerun all the batches again) — that sounds like more a new version.

For now I found a similar case in Using DVC to keep track of multiple model variants (which you may find interesting) where the foreach functionality was being thought of originally. So I must say your use seems to be aligned with the original intention of the feature after all. Doing some more research… :hourglass:

UPDATE: I started dvc.yaml: foreach stages use case? · Issue #5440 · iterative/dvc · GitHub to discuss this for now. Thanks