Hi,
Right now I have a pipeline in which the first stages create some initial data and train an initial model. After that I have a stage which uses this model to create new samples for which it has high uncertainty. Then there is a stage which runs some reference calculations to get the labels of the new samples. After that, in another stage the new samples are used to retrain the model. I run this calculations on a SLURM cluster and the commands of the stages are submitting jobs with sbatch. While the generation of new samples and the retraining is running on GPUs the calculation of new labels is happening on CPUs. The structure of the pipeline right now is something like this:
stages:
< some initial stages >
uncertainty_optimization:
cmd: sbatch --wait jobs/uncertainty_optimization.sh # running on GPU
...
calculate_labels:
cmd: sbatch --wait jobs/calculate_labels.sh # running on CPU
...
retrain_model:
cmd: sbatch --wait jobs/retrain_model.sh # running on GPU
...
Now I want to run this iteratively. Are there any suggestions what would be a good way to archive this? Thank you very much!
Best regards,
Matthias
Could you clarify please, what exactly are you trying to achieve?
I want to run the stages multiple times, there will be always the same code only the index of the respective iteration is handed over, and they will depend on the output of the iterations of the prior iteration. So for the first two iteration it would be something like:
uncertainty_optimization:
cmd: sbatch --wait jobs/uncertainty_optimization.sh 0 # running on GPU
...
calculate_labels:
cmd: sbatch --wait jobs/calculate_labels.sh 0 # running on CPU
...
retrain_model:
cmd: sbatch --wait jobs/retrain_model.sh 0 # running on GPU
uncertainty_optimization:
cmd: sbatch --wait jobs/uncertainty_optimization.sh 1 # running on GPU
...
calculate_labels:
cmd: sbatch --wait jobs/calculate_labels.sh 1 # running on CPU
...
retrain_model:
cmd: sbatch --wait jobs/retrain_model.sh 1 # running on GPU
So the stages with index 0 will produce outputs, e.g. the retrained model, which will be used in the next iteration. Maybe it is questionable if it makes sense to have this as separate stages, but all stages have a lot of content and produce an output which should be tracked, also I would like to be able to first run a few iterations and then continue with a few more iterations. Guess I could just use a script to create a dvc.yaml with different iterations as stages. I was just wondering if there is any smart way to do this which I am missing or if there are any recommendations/thoughts.
Okay, I see. I think what can help is to use persistent
outputs (persist: true
flag in dvc.yaml
spec) + pass iteration as a parameter. Then you would be able to run the pipeline with something like dvc exp run -S 'iteration=1'
. Or just change params.yaml
file yourself and run it again.
Also, if you have a predefined number of iterations and it’s not to large, I think you could try to use matrix
or for
loop in dvc.yaml
.