DVC on HPC with CML and large(r) number of experiments

Hi, we are currently testing and configuring a new HPC setup and I will be working on setting up a best practice guide and template projects for running ML/Data science project using DVC, CML and possibly other tools.

Based on the examples I read, I am wondering if the CML workflow also supports large numbers of parameter grid searches?

Thank you!

The DVC/CML documentation does mention something like this, but not how. Link: CI/CD for Machine Learning | Data Version Control · DVC Bottom of the page.

Not sure I follow, but this doesn’t seem particularly CML-related. The bottom of that article (regarding grid search) links to DVC Get Started: Experiments (which you could of course run in a CML workflow)?

How I understand it, which might be incomplete: One makes some changes to the code or params etc. then pushes these changes to a git remote, this triggers the training (at least I would like to have that triggered) of that specific commit in a single experiment.

I don’t understand how to perform a gridsearch (or just several experiments) using CML, without following the workflow described above for each entry.

I’m still looking for an understanding of the intended workflow in this case. If anyone has something to add please do :slight_smile:

Hi @RiCk, there’s not a documented workflow for this scenario today, but I’d be curious to discuss it with you.

There’s some support for doing a grid search in DVC: exp run. If you use this starting point, I could imagine a couple ways to run the search on CML:

  1. Put this into your CML workflow so that the workflow has something like dvc exp run -S 'train.min_split=2,8,64' -S 'train.n_est=100,200' --queue; dvc queue start. This might be limiting since it would require editing the workflow file with the search parameters.
  2. Queue experiments locally and push each one to trigger its own CML job. You could run dvc exp run -S 'train.min_split=2,8,64' -S 'train.n_est=100,200' --queue locally and then push each one to its own branch that would each trigger a CML job. This would currently require a bash script to parse all the queued experiment names and push them to branches, but if it works well for you, I’m sure we could make it easier to do within DVC.

Hi @dberenbaum, thank you very much for the reply! I really see the potential of using DVC and family and appreciate it very much these kind of discussions are possible :slight_smile:

I think exp run is a good starting point. Reading the two options I feel option 2:

Queue experiments locally and push each one to trigger its own CML job.

would be the nicer solution, because of three reasons: First, one does not need to change the DVC pipeline, which feels cleaner and probably integrate better with existing “local” grid search methods. Second, one would still be able to work as “normal”, i.e., pushing a single experiment by committing their changes to hyper parameters and making a PR, which than triggers the CML runner. Third, by creating separate jobs one can have multiple runners pick up those jobs and run them in parallel?!

Currently, the only “criticism” I have heard from colleagues that they don’t like the potential creation of large numbers of branches in git. I think this might just be something we need to live with because the whole workflow is based on git, I just wanted to mentioned it in case it resonates with you somehow.

Not unimportant to ask you as well: Do you have the mentioned bash script laying around and could you share this with me?

That all makes sense. I think you could actually fit the custom bash logic into one line like:

dvc queue status | grep Queued | xargs -n 6 bash -c 'git push origin $0:refs/heads/dvc-exp/$1'

This would filter a list of all queued experiments and then run git push origin <rev>:refs/heads/dvc-exp/<name>, which should push each queued revision to a branch named dvc-exp/<name>.

Regarding the noise of the branches, I agree. That’s why I used dvc-exp/<name> in the code above, so at least you can easily recognize those branches. It would be great if we can think of better ways to reduce this noise like we do locally with custom refs.

Let me know how this approach works for you!

Thank you! I will try it out and let you know.

Regarding multiple runners and cache, especially writing it. Would it give problems when I would have multiple runners that have access to the same cache?

Also is there way in CML to attach some persistent (kubernetes) volume to the runner that serves as cache? Or how do people solve this?

Rick

It should not be a problem to have multiple runners access the same cache. Unfortunately, I don’t know the best way to attach a persistent volume to the runner. You may want to take a look at the discussion in Support volume mounts (e.g. nfs) for Kubernetes · Issue #658 · iterative/terraform-provider-iterative · GitHub.