In machine learning, it is important to repeat multiple time the same training (often with a different seed) and then compute the average and standard deviation for a metric to evaluate the stability of the model and check the statistical significance of the results.
Let’s say we have a stage
trainmodel which has a parameter
seed and we want to run the training 10 times with 10 different seeds. Once all trainings have been completed, we have a script that takes as input the results (metric files) from the 10 runs and produces the average and standard deviation for each metric.
The thing is that I don’t want to store the 10 trained models (since it is very heavy) but only keep the best trained model.
However, I want to keep the metrics of all 10 runs in order to be able to add a stage that compute the average and std of all metrics.
I also want to launch all 10 trainings in parallel.
What would be the best way to do this with dvc ?