the main goal is automating as much as possible.
Have you started using DVC at all? If so how?
I have used dvc before but not in this project. So it’s a more general question about a good slurm cluster workflow.
What does your pipeline look like? What are you processing, what stages are there?
This is what I am using:
- mlflow for tracking results
- optuna for hyperparameter optimization
- gitlab for sharing code
My experiment workflow looks like this:
- change a bigger part of my code locally (for example different network architecture)
- test locally
- push to gitlab
- login to cluster and pull changes
- submit several jobs: since I use optuna I don’t have to change parameters or config files. It sets parameters more or less random on the fly and I can submit several jobs at once.
- wait for long time until first job runs to make sure there is no error (I cant test everything locally)
- wait long time until last job finished and commit results, (mlruns folder, optuna database)
- push results
- pull results locally
What do you mean by “result management”? Tracking experiment results perhaps? If so, do you do this with certain metrics, which ones?
With “result management” I mean “everything you do with your results after you collected it”. For example creating a static html with condensed information that everyone (also “non-techies”) can use and interpret. But that is not special to those clusters I think. Probably I just wanted to say that I find it hard to test everything. I track metrics, parameters, artifacts (some images) but no bigger files at the moment
For example if you use SSH data storage, you control that via system users. If you use Azure for remote storage, then similarly each user can be required to provide their own connection string, letting you control granular access.
The bigger the data the less often you want to move it. So on this cluster you have a home folder where you can put the data and the running nodes have access to that. It’s not an important thing for me right now, just wanted to know if there is a “nice” solution at the moment.