Pipline with many parallel stages

maartenb · September 23, 2021, 12:57pm

I would like to use DVC for project that’s not ML but a sensitivity analysis for a bunch of a parameters of a simulation model.

The DAG of my pipline looks basically like the diagram below. There’s a single “command” file (box at the top) in which all simulations are defined (a csv file with one row for each simulation). Each column in the 3x4 blocks represents the set of steps for one simulation (preprocessing, running, postprocessing). Finally, combined results are created from groups of simulations (the 3x4 blocks).

I want to make sure that, if a change is made to the command file for one simulation, only the stuff for that simulation is redone. It would be OK if the first step (the arrows from the command file to the first row) is always run for all simulations (this is a minor step). But for the following steps should be done only for the relevant simulations.
I could create a stage for every simulation and every step but this is hassle (in reality there are hundreds of simulations). Also I want to make it easy to add more simulations, by adding more lines to the command csv file.

Is there functionality in DVC to handle this kind of situation? Is that a way to restructure my project to make things easier?

isidentical · September 23, 2021, 3:01pm

I want to make sure that, if a change is made to the command file for one simulation, only the stuff for that simulation is redone.

How DVC pipelines work is that, if any dependencies of any stage you are targetting on the workspace we start from them and reproduce every stage they are connected through dependency/output pair relations. You can create a dependency against the command file, which would allow you to re-run that stage (and all bound stages, if there are any) automatically.

I could create a stage for every simulation and every step but this is hassle (in reality there are hundreds of simulations). Also I want to make it easy to add more simulations, by adding more lines to the command csv file.

There are 2 options that I can see,
a) Use foreach loops: they are relatively limited, so might not support your use case but if you want to repeat a stage with the same shape over different commands, then they should work
b) Generate your own DVC files: you can write a script that takes all the arguments you need and outputs a dvc.yaml file with the format you want (or can use dvc stage add instead of manually generating a YAML template).

They both have their pluses/minuses (esp regarding complexity), but might be worth to give a shot.

dberenbaum · September 23, 2021, 3:39pm

One more thing to consider is what to do with your initial CSV file based on the options mentioned above:
a) If you use foreach loops, you may want to consider restructuring your CSV file into a supported parameters file format: https://dvc.org/doc/user-guide/project-structure/pipelines-files#parameter-dependencies. This would allow you to have individual stages only depend on the specified parameters from that file.
b) If you take the suggested approach of generating your own dvc.yaml files, you can probably read in the CSV file to generate your pipeline without making it a dependency within the pipeline itself.

maartenb · September 27, 2021, 3:30pm

Thanks. After reading up on parameter files and foreach I believe a combination of the two may be what I need.

Topic		Replies	Views
Using DVC for end-to-end pipeline Questions	6	1613	January 5, 2019
Simplifying `dvc run` and pipelines Feature Requests	3	1748	August 29, 2019
Best practice for applying pipelines to many datasets? Questions	3	948	April 6, 2022
Creating an aggregate .dvc file Questions	11	3342	October 17, 2018
Same output for different pipelines Questions	8	137	November 22, 2024

Pipline with many parallel stages

Related topics