Workflow on slurm-like clusters

Hi,
I have access to a slurm (gpu) cluster and my current workflow is quite a mess. Is there any advice or experience with that? So some problems I see:

  • you have to submit jobs:

    • jobs are scheduled, delayed (sometimes hours)
    • jobs can run in parallel
    • you have to “wait” till the last job started, before you can queue new jobs
    • manual “committing” of results
  • docker unfriendly

  • testing the whole pipeline from data processing to “result management” is hard

  • data security: everybody has access to everything:
    There are tools like rclone and crypten. Can dvc help with this problem or interact with those tools?
    Can dvc / cml help there to automate things?
    Thanks in advance. :grinning:

Really no one has any thought on one of the points? Or is something unclear?

Oops sorry about that @xyz, this must have slipped through the cracks.

It would be helpful if you can elaborate more on your workflow and what you’re trying to achieve that maybe DVC could help with. Some doubts:

  • What does your pipeline look like? What are you processing, what stages are there?
  • Have you started using DVC at all? If so how?
  • What do you mean by “result management”? Tracking experiment results perhaps? If so, do you do this with certain metrics, which ones?

DVC leaves data access management up to you. It supports multiple remote storage platforms, and most of them let you setup user access controls. For example if you use SSH data storage, you control that via system users. If you use Azure for remote storage, then similarly each user can be required to provide their own connection string, letting you control granular access.

Thanks

Hi,
the main goal is automating as much as possible.

Have you started using DVC at all? If so how?

I have used dvc before but not in this project. So it’s a more general question about a good slurm cluster workflow.

What does your pipeline look like? What are you processing, what stages are there?

This is what I am using:

  • mlflow for tracking results
  • optuna for hyperparameter optimization
  • gitlab for sharing code

My experiment workflow looks like this:

  1. change a bigger part of my code locally (for example different network architecture)
  2. test locally
  3. push to gitlab
  4. login to cluster and pull changes
  5. submit several jobs: since I use optuna I don’t have to change parameters or config files. It sets parameters more or less random on the fly and I can submit several jobs at once.
  6. wait for long time until first job runs to make sure there is no error (I cant test everything locally)
  7. wait long time until last job finished and commit results, (mlruns folder, optuna database)
  8. push results
  9. pull results locally

What do you mean by “result management”? Tracking experiment results perhaps? If so, do you do this with certain metrics, which ones?

With “result management” I mean “everything you do with your results after you collected it”. For example creating a static html with condensed information that everyone (also “non-techies”) can use and interpret. But that is not special to those clusters I think. Probably I just wanted to say that I find it hard to test everything. I track metrics, parameters, artifacts (some images) but no bigger files at the moment

For example if you use SSH data storage, you control that via system users. If you use Azure for remote storage, then similarly each user can be required to provide their own connection string, letting you control granular access.

The bigger the data the less often you want to move it. So on this cluster you have a home folder where you can put the data and the running nodes have access to that. It’s not an important thing for me right now, just wanted to know if there is a “nice” solution at the moment.

Thanks

1 Like

Gotcha. I’ll check if other team members have specific tips for those tools.

I know that DVC and MLFlow can happily coexist. But it you mainly use the latter for tracking/visualizing results, and since you’re using GitLab, you may be interested in our sister project CML (for this or other projects).

This seems like a good area where DVC can help. I’m just not sure about how to integrate it with the existing tools you’re using for hyperparams and metrics (DVC has it’s own solutions for that), you’d have to play around with it a bit — feel free to send us follow-up questions here or at dvc.org/chat.

But the tracking of large files/dirs is one of the core features of DVC.

Convenient! With DVC, you could use a “shared server” approach (variation): construct the pipeline stages using external dependencies to the data in ~/, since you know it will be there at run time. Locally you would need dummy or sample data files with the same file names to test your code.

Another option is to use external outputs — if you want to track changes in those data files/dirs.

1 Like