Composing params.yaml and dvc.yaml to manage multiple independent pipeline variants

Hi all,

I’m managing a DVC project where the same set of input data is used to produce multiple distinct outputs. Specifically, we train multiple models, each targeting a different output variable. In addition to these different targets, we also want to produce variants of each output depending on things like data resolution, input subsets, and statistical summary types (e.g., mean, max, etc.).

Currently, we handle this by creating a separate Git branch for each combination of these “treatment” settings (e.g., <source>_<subset>_<resolution>_<statistic>planet_africa_1km_mean), each with its own params.yaml and dvc.lock. All code changes are made in main, and then the treatment branches are rebased onto main to stay up to date. This works, but managing dozens of branches has become difficult and error-prone, especially for onboarding new contributors.

We’d like to restructure the project so we can:

  • Avoid separate Git branches per configuration
  • Keep treatment outputs isolated (i.e., avoid overwriting outputs or having DVC try to recompute irrelevant ones)
  • Run the pipeline for a specific treatment/configuration in isolation (i.e. don’t trigger checks for all other stages/outputs)
  • Treat each output as a legitimate final product (not just an intermediate experiment)

We’ve considered dvc exp run, but experiments seem more focused on comparing runs or selecting a “best” model, which isn’t our goal here.

Is there a way to structure this using directories, parameter files, or stages to keep treatments separate without branching? Would you recommend a mono-repo with subdirectories per config, or some sort of templating approach?

Ideally we would like to avoid having to rewrite the same stages and params across many different dvc.yamls and params.yamls if possible, and be able to simply add novel or non-default stages and params in the treatment-specific files, if that makes sense. The Hydra implementation seemed promising, but it also appears to be primarily tied to the experiments feature.

Thanks for any suggestions!

1 Like

@lusk have you checked by chance matrix and foreach dvc.yaml commands? That’s an alternative way to be able to parametrize and produce many outputs at once. You should be also able to run a single output production by specifying a concrete stage name (those are autogenerated, you can see them in dvc.lock and / or dvc dag).

Let me know wyt.

1 Like