Statistical significant stage best practice

Hello,

In machine learning, it is important to repeat multiple time the same training (often with a different seed) and then compute the average and standard deviation for a metric to evaluate the stability of the model and check the statistical significance of the results.

Let’s say we have a stage trainmodel which has a parameter seed and we want to run the training 10 times with 10 different seeds. Once all trainings have been completed, we have a script that takes as input the results (metric files) from the 10 runs and produces the average and standard deviation for each metric.

The thing is that I don’t want to store the 10 trained models (since it is very heavy) but only keep the best trained model.
However, I want to keep the metrics of all 10 runs in order to be able to add a stage that compute the average and std of all metrics.
I also want to launch all 10 trainings in parallel.

What would be the best way to do this with dvc ?

Hello @kwon-young!
As to parallelization: we have an open issue for that, and running multiple stages in parallel is currently not possible:
https://github.com/iterative/dvc/issues/755

It is possible to parallelize experiment runs when using dvc exp run with --jobs (https://dvc.org/doc/command-reference/exp/run) option, however it seems that it is not applicable in your case if you later want to join the results of particular runs.


Approach #1:

It seems to me that you could be interested in pipelines files parametrization (https://dvc.org/doc/user-guide/project-structure/pipelines-files#templating) which essentialy allows to parametrize stages. In your case you could use seed to parametrize your pipeline and later down the road join the results of the stages.

As to storing metrics and not storing intermediate results:
when creating your pipeline, use -O/--outs-no-cache to prevent dvc from caching the results (in that case remember to not add them to git, sometimes it happens when one uses git add -A), and add additional step that will choose the best and output it as dvc tracked file.

In case of metrics use -m (dvc will track it) or -M (in this case you need to track it with git)


Alternative approach:

  1. Make list of seeds a parameter in params.yaml
  2. Make train scrip accept list of seeds and create the directory of models (not tracked in dvc, look previous approach) and file with all metrics.
  3. Add step choosing the best model

This approach has an advantage, that it can run multiple trainings in parallel if you implement it in your script.

Thank you very much for your thoughts !

I suppose I could use a variable to parameterize my metric output name so that the 10 runs won’t overwrite each others.

Thanks, I did not thought of this. However, this means that I need another almost identical stage which will save the trained model with maybe the best seed or parameter combination…
I still need to decide what is the best tradeoff between diskspace/computation time…

Unfortunately, i parallelize my trainings using a slurm like cluster, so it is really hard to do what you described.

No need to worry about overriding if you will make code and dvc aware of param. Ex:
train.py --seed 10 could produce model_seed_10. In dvc pipeline file you would need to define the output as model_seed_${seed}. I recommend reading through pipelines file docs page. This is really powerful feature.

Ah, then its not as trivial as simple python run. How does invoking the job for particular seed look like? maybe we could abstract that somehow? Is it a bash script or some cli command?

The basic idea is that you have to use a job scheduler with a bash script in which i run my script. So currently, my bash script consists of:

dvc exp run $@
dvc exp push

You can submit a job from a frontend server using myscheduler-submit -S mybashscript expename -S someparams=3 and the job will be run when there are available nodes.

Once a job is finished, i get back the experiment results using dvc pull origin myexpe

I have read that page and saw the foreach feature. This is cool because it allows to write down clearly the parameter sweeps I have done. But can each different stage produced be run in parallel ?

Could you design your process like:

  1. Make list of seeds a parameter in params.yaml.
  2. Make a training script that takes a single seed and trains a single model.
  3. Make a metascript that (a) takes all seeds; (b) iterates over each seed and submits a job for each that calls the training script; (c) collects the results, determines the best model, and saves only the outputs you need.
  4. Make that metascript your dvc stage.

You wouldn’t get the granularity of having each model training be in its own stage, but it would parallelize your jobs and allow you to only track what you need in the end.

I often wanted to do just this but there are multiple problems with this approach.
So (a) and (b) is easy. But (c) is really hard to do since the execution of jobs is asynchronous:

  • knowing when all jobs are finished: I would need to parse the output of cli command and check them periodically, while keeping track of which jobs I have launched
  • collects the results: currently, I use dvc exp push/pull but that would not be possible with a metascript
  • logging in the training script: currently I use dvclive, but that also would not be available with a metascript

Also, if I make my pipeline depends on the cluster, now I need to run my pipeline on the cluster frontend server, which is just a weak server for submitting job and where installing dvc is out of question.

How about this strategy:

Let’s say we have a pipeline with 3 stages:

preprocessing → training → evaluation

  • preprocessing: transforms data deterministcally, so can be cached and retrieve for each experiments
  • training: produces metrics and a trained model, which is very large to store on disk
  • evaluation: uses a trained model to produce more metrics, maybe visualization…

Let’s say we want to do a large grid-search on parameters and seeds for the training stage.
We want to keep all the metrics produced by the training stage but not the models.
However, for a small number of manually picked and interesting parameter combinations, we want to use the trained model to do the evaluation stage.

So, for the training stage, I actually use two version of the training stage:

  • training: the original stage which saves the results and the model
  • training-gs: the stage to use for a grid-search which won’t save the model and save metrics in files with names different for each parameter combination. all grid-search search parameter combination is stored using the foreach feature of a stage

So the pipeline looks now like this:

preprocessing → training → evaluation
------------------- → training-gs

First, I start by running all my training grid-search stages. I get the results, analyse them, reduce them and pick a few good parameter combinations.
Relaunch theses combinations using the training stage in order to do the evaluation stage of each of them.

That sounds like a reasonable strategy if it works for you. A couple questions:

  1. Are you doing 10 experiments with different seeds for each parameter combo, aggregating metrics for all 10 experiments in each parameter combo, and then choosing the best parameter combo?
  2. Are you using the full training dataset for each experiment in the grid-search or are you sampling the data?

There are some related Github issues to your use case:

Please add any feedback you have. Experiments are still a pretty new feature, so how to best support grid searches is still an open question.

By the way, I don’t now the details of your environment or scheduler, but at least in slurm I think you can use srun usually if you need jobs to be synchronous, and maybe even nest those inside of sbatch jobs to get the best of both worlds. You might be able to submit the full pipeline through sbatch, which does a bunch of dvc exp run --queue commands followed by dvc exp run --run-all -j [number of jobs], which executes training stages that each include an srun command.

Yes, that’s the plan

For now, I’m using the same training/validation split for every experiment.
Adding kfold cross-validation is way too complex for now since it adds another dimension to the grid-search.

Unfortunately, my scheduler does not have the equivalent srun command but I understand your strategy. The idea is to translate the pipeline execution using dvc to the relevant scheduler commands. Thank you for the insight.