Multiple dvc runs in parallel


I am trying to run multiple dvc experiments simultaneously on a gpu cluster. The way the cluster works is that you send a bash script to the cluster and it runs the job when its your turn. In my bash script, I configure CUDA_VISIBLE_DEVICES, activate my conda environment and run the experiment using dvc exp run --temp. Before I submit my job, I make the changes I want for my experiment, e.g. set the variables in params.yaml. I am trying to send a job, wait for it to start running, then make changes and send another. The second job fails with this error:

ERROR: Unable to acquire lock. Most likely another DVC process is running or was terminated abruptly. Check the page <> for other possible reasons and to learn how to resolve this.

I was hoping that using --temp would allow me run these jobs at the same time because it isolates the project in a separate directory. One thing I should note is that my cache dir is in a separate directory outside my project directory because I am not allowed to use that much space in my user directory.

My question is related to this one: in which the suggested course was to use --queue and -j to run the experiments in parallel. How can I use the built in parallel experiment execution but run the experiments in separate job submissions to the cluster? Or what is the intended workflow for this sort of situation?


We don’t have a recommended workflow in such situation per see, thought we might provide some suggestion how you could approach such situation. One way could be to provide GPU via params, for example schedule only the number of experiments that you have available GPU’s, and for each one provide different GPU-id. That will require some legwork, if you have more experiments than GPU’s, because you will have to schedule experiments → train → schedule experiments → train and so on, until all the experiments are run.

Alternative approach, if you have control over the training server, is to figure a way to “automatically” reserve the GPU. One way to do that would be to use some common directory and files inside as an indicator which GPU’s are reserved, and which are free.
For example, following code:

import zc.lockfile
from contextlib import contextmanager
import os
import time
from concurrent.futures import ThreadPoolExecutor

GPU_IDS = [0,1,2]

lock_dir = "locks"
os.makedirs(lock_dir, exist_ok=True)

def lock(gpu_id):
    path = os.path.join(lock_dir, str(gpu_id))
    return zc.lockfile.LockFile(path)

def lock_free_gpu():
    for gpu_id in GPU_IDS:
            l = lock(gpu_id)
            yield gpu_id
        except zc.lockfile.LockError:
            print(f"Can't lock '{gpu_id}', trying another one")

def training_task(task_number):
    with lock_free_gpu() as gpu_id:
        print(f"Working on task: '{task_number}', gpu: '{gpu_id}'")
        # simulate different time of execution

with ThreadPoolExecutor(max_workers=3) as executor:
    for i in range(20):
        executor.submit(training_task, i)

“Reserves” GPU’s by locking appropriate file, which path equals to locks/{GPU-id}, so that other run know not to use this one. This approach makes sense only if all the server users will use “locking” context manager, you also need to make sure to run the exp with --jobs <= number_of_gpus.

Thanks for the response!

The problem is that I don’t know which GPU I will get until the cluster assigns me one. The bash script I send to the cluster determines which GPU I have been assigned and sets the environment variable as needed.

Also would specifying the GPU get me around the lock issue? It seems like dvc has an issue with me having multiple processes access the repo or the data or something like that. I was hoping that isolating the experiments with --temp would get me around this but it doesn’t seem like it’s working the way I assumed it would.

Hi @gregstarr! Unfortunately, it’s not yet possible to start new experiments while others are running. You would have to queue them and run them simultaneously to take advantage of parallel execution. We are working on addressing this limitation right now, and you can keep track of the progress in exp queue: command for managing queues · Issue #5615 · iterative/dvc · GitHub.

Thanks for the response. Bummer that it is not possible at the moment, but I’m glad to see it’s being worked on!

So, what you can do, is to queue few experiments and the dvc exp run --run-all -j {num_jobs}. But in order to be able to do that, you need some magic to determine which GPU your process should take. You say that you have a bash script determining that? Then maybe your stage cmd should look like
./ && python

By the way, there is one hacky workaround available now if you need to run jobs simultaneously. You can make multiple copies of your repo and run one experiment per copy.

Ok I think I understand. Thanks for the clarification.

The current workflow is to simply submit my bash script to the cluster job manager. The bash script looks something like this:

  1. misc. configuration
  2. name={read the file that says which gpu I am assigned}
  3. export CUDA_VISIBLE_DEVICES="${name}"
  4. activate conda env
  5. dvc exp run --temp

Then I submit the job with a command: qsub This submits the job to the job manager and returns.

It sounds like what I need to do instead is have my command submit the job? i.e. I make my changes, queue up my experiments, then do dvc exp run --run-all -j {num_jobs}, but my whole pipeline is just the one command qsub

Is this right?

If I go this route, is there a nice way to merge stuff together? Like copying the git/ref/exp folder and the .dvc/tmp/exp folder?

Yes, if you are able to determine all experiments you want to run before submitting, you can invert your workflow and have dvc call qsub instead. Whether you submit your whole pipeline as one qsub job or split each stage into its own qsub job is up to you.

Hello again. This is the solution I ended up trying but I ran into a different issue when trying to dvc exp run multiple in parallel. It has trouble cleaning up because some git directory isn’t empty. It sounds exactly like this issue and I posted debug output to this issue: