Workflow on slurm-like clusters

xyz · September 6, 2020, 2:09pm

Hi,
I have access to a slurm (gpu) cluster and my current workflow is quite a mess. Is there any advice or experience with that? So some problems I see:

you have to submit jobs:
- jobs are scheduled, delayed (sometimes hours)
- jobs can run in parallel
- you have to “wait” till the last job started, before you can queue new jobs
- manual “committing” of results
docker unfriendly
testing the whole pipeline from data processing to “result management” is hard
data security: everybody has access to everything:
There are tools like rclone and crypten. Can dvc help with this problem or interact with those tools?
Can dvc / cml help there to automate things?
Thanks in advance.

xyz · September 19, 2020, 5:32am

Really no one has any thought on one of the points? Or is something unclear?

jorgeorpinel · September 19, 2020, 7:30pm

Oops sorry about that @xyz, this must have slipped through the cracks.

It would be helpful if you can elaborate more on your workflow and what you’re trying to achieve that maybe DVC could help with. Some doubts:

What does your pipeline look like? What are you processing, what stages are there?
Have you started using DVC at all? If so how?
What do you mean by “result management”? Tracking experiment results perhaps? If so, do you do this with certain metrics, which ones?

DVC leaves data access management up to you. It supports multiple remote storage platforms, and most of them let you setup user access controls. For example if you use SSH data storage, you control that via system users. If you use Azure for remote storage, then similarly each user can be required to provide their own connection string, letting you control granular access.

Thanks

xyz · September 22, 2020, 9:03am

Hi,
the main goal is automating as much as possible.

Have you started using DVC at all? If so how?

I have used dvc before but not in this project. So it’s a more general question about a good slurm cluster workflow.

What does your pipeline look like? What are you processing, what stages are there?

This is what I am using:

mlflow for tracking results
optuna for hyperparameter optimization
gitlab for sharing code

My experiment workflow looks like this:

change a bigger part of my code locally (for example different network architecture)
test locally
push to gitlab
login to cluster and pull changes
submit several jobs: since I use optuna I don’t have to change parameters or config files. It sets parameters more or less random on the fly and I can submit several jobs at once.
wait for long time until first job runs to make sure there is no error (I cant test everything locally)
wait long time until last job finished and commit results, (mlruns folder, optuna database)
push results
pull results locally

What do you mean by “result management”? Tracking experiment results perhaps? If so, do you do this with certain metrics, which ones?

With “result management” I mean “everything you do with your results after you collected it”. For example creating a static html with condensed information that everyone (also “non-techies”) can use and interpret. But that is not special to those clusters I think. Probably I just wanted to say that I find it hard to test everything. I track metrics, parameters, artifacts (some images) but no bigger files at the moment

For example if you use SSH data storage, you control that via system users. If you use Azure for remote storage, then similarly each user can be required to provide their own connection string, letting you control granular access.

The bigger the data the less often you want to move it. So on this cluster you have a home folder where you can put the data and the running nodes have access to that. It’s not an important thing for me right now, just wanted to know if there is a “nice” solution at the moment.

Thanks

jorgeorpinel · September 22, 2020, 4:25pm

Gotcha. I’ll check if other team members have specific tips for those tools.

I know that DVC and MLFlow can happily coexist. But it you mainly use the latter for tracking/visualizing results, and since you’re using GitLab, you may be interested in our sister project CML (for this or other projects).

This seems like a good area where DVC can help. I’m just not sure about how to integrate it with the existing tools you’re using for hyperparams and metrics (DVC has it’s own solutions for that), you’d have to play around with it a bit — feel free to send us follow-up questions here or at dvc.org/chat.

But the tracking of large files/dirs is one of the core features of DVC.

Convenient! With DVC, you could use a “shared server” approach (variation): construct the pipeline stages using external dependencies to the data in ~/, since you know it will be there at run time. Locally you would need dummy or sample data files with the same file names to test your code.

Another option is to use external outputs — if you want to track changes in those data files/dirs.

Topic		Replies	Views
Looking for Workflow Suggestion Questions	2	181	December 21, 2023
Git Flow for DVC 🌿 General	5	8420	December 11, 2020
Need to build non-ML data pipeline, is DVC good fit? Questions	7	1170	August 24, 2021
Multiple machines setup for one repo Questions	3	1201	October 30, 2020
DVC 0.18 is out! Announcements	0	837	August 28, 2018

Workflow on slurm-like clusters

Related topics