Is there a way to iteratively run stages of a pipeline?

Matthias · February 6, 2025, 6:13pm

Hi,

Right now I have a pipeline in which the first stages create some initial data and train an initial model. After that I have a stage which uses this model to create new samples for which it has high uncertainty. Then there is a stage which runs some reference calculations to get the labels of the new samples. After that, in another stage the new samples are used to retrain the model. I run this calculations on a SLURM cluster and the commands of the stages are submitting jobs with sbatch. While the generation of new samples and the retraining is running on GPUs the calculation of new labels is happening on CPUs. The structure of the pipeline right now is something like this:

stages:
< some initial stages >

uncertainty_optimization:
    cmd: sbatch --wait jobs/uncertainty_optimization.sh  # running on GPU
    ...

calculate_labels:
    cmd: sbatch --wait jobs/calculate_labels.sh  # running on CPU
    ...

retrain_model:
    cmd: sbatch --wait jobs/retrain_model.sh  # running on GPU
    ...

Now I want to run this iteratively. Are there any suggestions what would be a good way to archive this? Thank you very much!

Best regards,
Matthias

shcheklein · February 12, 2025, 3:11am

Could you clarify please, what exactly are you trying to achieve?

Matthias · February 13, 2025, 10:02am

I want to run the stages multiple times, there will be always the same code only the index of the respective iteration is handed over, and they will depend on the output of the iterations of the prior iteration. So for the first two iteration it would be something like:

uncertainty_optimization:
    cmd: sbatch --wait jobs/uncertainty_optimization.sh 0  # running on GPU
    ...

calculate_labels:
    cmd: sbatch --wait jobs/calculate_labels.sh 0 # running on CPU
    ...

retrain_model:
    cmd: sbatch --wait jobs/retrain_model.sh 0 # running on GPU

uncertainty_optimization:
    cmd: sbatch --wait jobs/uncertainty_optimization.sh 1  # running on GPU
    ...

calculate_labels:
    cmd: sbatch --wait jobs/calculate_labels.sh 1 # running on CPU
    ...

retrain_model:
    cmd: sbatch --wait jobs/retrain_model.sh 1 # running on GPU

So the stages with index 0 will produce outputs, e.g. the retrained model, which will be used in the next iteration. Maybe it is questionable if it makes sense to have this as separate stages, but all stages have a lot of content and produce an output which should be tracked, also I would like to be able to first run a few iterations and then continue with a few more iterations. Guess I could just use a script to create a dvc.yaml with different iterations as stages. I was just wondering if there is any smart way to do this which I am missing or if there are any recommendations/thoughts.

shcheklein · February 13, 2025, 4:55pm

Okay, I see. I think what can help is to use persistent outputs (persist: true flag in dvc.yaml spec) + pass iteration as a parameter. Then you would be able to run the pipeline with something like dvc exp run -S 'iteration=1'. Or just change params.yaml file yourself and run it again.

Also, if you have a predefined number of iterations and it’s not to large, I think you could try to use matrix or for loop in dvc.yaml.

Topic		Replies	Views
Statistical significant stage best practice Questions	9	881	June 9, 2021
Pipline with many parallel stages Questions	3	872	September 27, 2021
Best practice for applying pipelines to many datasets? Questions	3	953	April 6, 2022
Versioning predictions Questions	7	954	February 10, 2021
Using DVC for end-to-end pipeline Questions	6	1613	January 5, 2019

Is there a way to iteratively run stages of a pipeline?

Related topics