Multiple dvc runs in parallel

gregstarr · March 7, 2022, 3:11pm

Hello,

I am trying to run multiple dvc experiments simultaneously on a gpu cluster. The way the cluster works is that you send a bash script to the cluster and it runs the job when its your turn. In my bash script, I configure CUDA_VISIBLE_DEVICES, activate my conda environment and run the experiment using dvc exp run --temp. Before I submit my job, I make the changes I want for my experiment, e.g. set the variables in params.yaml. I am trying to send a job, wait for it to start running, then make changes and send another. The second job fails with this error:

ERROR: Unable to acquire lock. Most likely another DVC process is running or was terminated abruptly. Check the page <https://dvc.org/doc/user-guide/troubleshooting#lock-issue> for other possible reasons and to learn how to resolve this.

I was hoping that using --temp would allow me run these jobs at the same time because it isolates the project in a separate directory. One thing I should note is that my cache dir is in a separate directory outside my project directory because I am not allowed to use that much space in my user directory.

My question is related to this one: https://discuss.dvc.org/t/lock-error-with-parallelized-dvc-repro/929/2 in which the suggested course was to use --queue and -j to run the experiments in parallel. How can I use the built in parallel experiment execution but run the experiments in separate job submissions to the cluster? Or what is the intended workflow for this sort of situation?

Thanks!

Paffciu · March 7, 2022, 3:55pm

We don’t have a recommended workflow in such situation per see, thought we might provide some suggestion how you could approach such situation. One way could be to provide GPU via params, for example schedule only the number of experiments that you have available GPU’s, and for each one provide different GPU-id. That will require some legwork, if you have more experiments than GPU’s, because you will have to schedule experiments → train → schedule experiments → train and so on, until all the experiments are run.

Alternative approach, if you have control over the training server, is to figure a way to “automatically” reserve the GPU. One way to do that would be to use some common directory and files inside as an indicator which GPU’s are reserved, and which are free.
For example, following code:

import zc.lockfile
from contextlib import contextmanager
import os
import time
from concurrent.futures import ThreadPoolExecutor

GPU_IDS = [0,1,2]

lock_dir = "locks"
os.makedirs(lock_dir, exist_ok=True)

def lock(gpu_id):
    path = os.path.join(lock_dir, str(gpu_id))
    return zc.lockfile.LockFile(path)

@contextmanager
def lock_free_gpu():
    for gpu_id in GPU_IDS:
        try:
            l = lock(gpu_id)
            yield gpu_id
            l.close()
            break
        except zc.lockfile.LockError:
            print(f"Can't lock '{gpu_id}', trying another one")
            pass

def training_task(task_number):
    with lock_free_gpu() as gpu_id:
        print(f"Working on task: '{task_number}', gpu: '{gpu_id}'")
        # simulate different time of execution
        time.sleep(gpu_id*5+5)

with ThreadPoolExecutor(max_workers=3) as executor:
    for i in range(20):
        executor.submit(training_task, i)

“Reserves” GPU’s by locking appropriate file, which path equals to locks/{GPU-id}, so that other run know not to use this one. This approach makes sense only if all the server users will use “locking” context manager, you also need to make sure to run the exp with --jobs <= number_of_gpus.

gregstarr · March 7, 2022, 4:29pm

Thanks for the response!

The problem is that I don’t know which GPU I will get until the cluster assigns me one. The bash script I send to the cluster determines which GPU I have been assigned and sets the environment variable as needed.

Also would specifying the GPU get me around the lock issue? It seems like dvc has an issue with me having multiple processes access the repo or the data or something like that. I was hoping that isolating the experiments with --temp would get me around this but it doesn’t seem like it’s working the way I assumed it would.

dberenbaum · March 7, 2022, 6:35pm

Hi @gregstarr! Unfortunately, it’s not yet possible to start new experiments while others are running. You would have to queue them and run them simultaneously to take advantage of parallel execution. We are working on addressing this limitation right now, and you can keep track of the progress in exp queue: command for managing queues · Issue #5615 · iterative/dvc · GitHub.

gregstarr · March 7, 2022, 6:41pm

Thanks for the response. Bummer that it is not possible at the moment, but I’m glad to see it’s being worked on!

Paffciu · March 7, 2022, 6:41pm

So, what you can do, is to queue few experiments and the dvc exp run --run-all -j {num_jobs}. But in order to be able to do that, you need some magic to determine which GPU your process should take. You say that you have a bash script determining that? Then maybe your stage cmd should look like
./bash_script_to_set_gpu_id.sh && python train.py?

dberenbaum · March 7, 2022, 6:50pm

By the way, there is one hacky workaround available now if you need to run jobs simultaneously. You can make multiple copies of your repo and run one experiment per copy.

gregstarr · March 7, 2022, 6:53pm

Ok I think I understand. Thanks for the clarification.

The current workflow is to simply submit my bash script to the cluster job manager. The bash script looks something like this:

misc. configuration
name={read the file that says which gpu I am assigned}
export CUDA_VISIBLE_DEVICES="${name}"
activate conda env
dvc exp run --temp

Then I submit the job with a command: qsub qsub_script.sh. This submits the job to the job manager and returns.

It sounds like what I need to do instead is have my command submit the job? i.e. I make my changes, queue up my experiments, then do dvc exp run --run-all -j {num_jobs}, but my whole pipeline is just the one command qsub qsub_script.sh?

Is this right?

gregstarr · March 7, 2022, 6:54pm

If I go this route, is there a nice way to merge stuff together? Like copying the git/ref/exp folder and the .dvc/tmp/exp folder?

dberenbaum · March 7, 2022, 7:14pm

Yes, if you are able to determine all experiments you want to run before submitting, you can invert your workflow and have dvc call qsub instead. Whether you submit your whole pipeline as one qsub job or split each stage into its own qsub job is up to you.

gregstarr · March 11, 2022, 6:38pm

Hello again. This is the solution I ended up trying but I ran into a different issue when trying to dvc exp run multiple in parallel. It has trouble cleaning up because some git directory isn’t empty. It sounds exactly like this issue and I posted debug output to this issue:

github.com/iterative/dvc

exp run: cannot clean up temp directory runs on Linux + NFS

opened 11:34AM - 17 Mar 21 UTC

closed 01:18AM - 30 Apr 21 UTC

werthen

bug p1-important research A: experiments

# Bug Report  ## Description When using the `exp run --run-all` feature, the command can never finish due to a error deleting a git folder, even though that folder seems to be empty. ### Reproduce 1. `dvc exp run --queue -S training.loss.name=mse` 2. `dvc exp run --run-all -j1 --verbose` ### Expected I expect the same output as running `dvc exp run -S training.loss.name=mse`, being a successful experiment with proper metrics shown in `dvc exp show`. Instead, when using the queueing functionality, the command errors and the metrics are not properly saved. The JSON representation of `dvc exp show` also states the experiment is still queued. ### Environment information  **Output of `dvc doctor`:** ```console $ dvc doctor DVC version: 2.0.6 (pip) --------------------------------- Platform: Python 3.8.5 on Linux-5.4.0-60-generic-x86_64-with-glibc2.10 Supports: hdfs, http, https Cache types: hardlink, symlink Cache directory: nfs4 on [REDACTED] Caches: local Remotes: None Workspace directory: nfs4 on [REDACTED] Repo: dvc, git ``` **Additional Information (if any):** Please note that at the end of this, it states the .git/object/pack directory is not empty, even though it can be deleted using rmdir without issue. ``` 2021-03-17 12:02:49,159 DEBUG: state save (152491799, 1615977912613432576, 22244) 0b8958b3e558a41b046785ece234cbee 2021-03-17 12:02:49,164 DEBUG: state save (152491799, 1615977912613432576, 22244) 0b8958b3e558a41b046785ece234cbee 2021-03-17 12:02:49,179 DEBUG: state save (152487234, 1615977912709433600, 6498) 6f3ff4ce2d56a0a19e27ed2ebce075fa 2021-03-17 12:02:49,183 DEBUG: state save (152487234, 1615977912709433600, 6498) 6f3ff4ce2d56a0a19e27ed2ebce075fa 2021-03-17 12:02:49,226 DEBUG: state save (152547862, 1615978823436614400, 293776) 343f60bca0df1ae495b598bbce29b384 2021-03-17 12:02:49,231 DEBUG: state save (152547862, 1615978823436614400, 293776) 343f60bca0df1ae495b598bbce29b384 2021-03-17 12:02:49,236 DEBUG: {'MS_data/model_weights.h5': 'modified'} 2021-03-17 12:02:49,260 DEBUG: state save (152544875, 1615978968985396480, 240801) 474d305f96b1fab25155390c8e7d2e2c 2021-03-17 12:02:49,264 DEBUG: state save (152544875, 1615978968985396480, 240801) 474d305f96b1fab25155390c8e7d2e2c 2021-03-17 12:02:49,270 DEBUG: {'notebook_results/Train.ipynb': 'modified'} 2021-03-17 12:02:49,287 DEBUG: state save (152544214, 1615978823200606208, 244) c0fda638e8a2cc6192b86dab818e8954 2021-03-17 12:02:49,292 DEBUG: state save (152544214, 1615978823200606208, 244) c0fda638e8a2cc6192b86dab818e8954 2021-03-17 12:02:49,297 DEBUG: {'metrics.csv': 'modified'} 2021-03-17 12:02:49,315 DEBUG: state save (152549822, 1615978930064190208, 126) 66eb73c07b00069143c48a321af0abd7 2021-03-17 12:02:49,319 DEBUG: state save (152549822, 1615978930064190208, 126) 66eb73c07b00069143c48a321af0abd7 2021-03-17 12:02:49,325 DEBUG: {'train_metrics.json': 'modified'} 2021-03-17 12:02:49,344 DEBUG: state save (152549303, 1615978965597293312, 126) b8d9d9ec45c24e1f65474a88d22e1b37 2021-03-17 12:02:49,348 DEBUG: state save (152549303, 1615978965597293312, 126) b8d9d9ec45c24e1f65474a88d22e1b37 2021-03-17 12:02:49,353 DEBUG: {'val_metrics.json': 'modified'} 2021-03-17 12:02:49,361 DEBUG: Computed stage: 'train_model' md5: '5ca694409ddab75eb4b0ff0deddbef02' 2021-03-17 12:02:49,453 DEBUG: state save (152547862, 1615978823436614400, 293776) 343f60bca0df1ae495b598bbce29b384 2021-03-17 12:02:49,459 DEBUG: Checking out 'MS_data/model_weights.h5' with cache 'object md5: 343f60bca0df1ae495b598bbce29b384'. 2021-03-17 12:02:49,474 DEBUG: Created 'copy': ../../../cache/34/3f60bca0df1ae495b598bbce29b384 -> MS_data/model_weights.h5 2021-03-17 12:02:49,475 DEBUG: state save (152550395, 1615978969461410816, 293776) 343f60bca0df1ae495b598bbce29b384 2021-03-17 12:02:49,544 DEBUG: state save (152544875, 1615978968985396480, 240801) 474d305f96b1fab25155390c8e7d2e2c 2021-03-17 12:02:49,550 DEBUG: Checking out 'notebook_results/Train.ipynb' with cache 'object md5: 474d305f96b1fab25155390c8e7d2e2c'. 2021-03-17 12:02:49,563 DEBUG: Created 'copy': ../../../cache/47/4d305f96b1fab25155390c8e7d2e2c -> notebook_results/Train.ipynb 2021-03-17 12:02:49,563 DEBUG: state save (152549836, 1615978969549413376, 240801) 474d305f96b1fab25155390c8e7d2e2c 2021-03-17 12:02:49,658 DEBUG: state save (152544214, 1615978823200606208, 244) c0fda638e8a2cc6192b86dab818e8954 2021-03-17 12:02:49,664 DEBUG: Checking out 'metrics.csv' with cache 'object md5: c0fda638e8a2cc6192b86dab818e8954'. 2021-03-17 12:02:49,670 DEBUG: Created 'copy': ../../../cache/c0/fda638e8a2cc6192b86dab818e8954 -> metrics.csv 2021-03-17 12:02:49,671 DEBUG: state save (152550397, 1615978969657416704, 244) c0fda638e8a2cc6192b86dab818e8954 2021-03-17 12:02:49,739 DEBUG: state save (152549822, 1615978930064190208, 126) 66eb73c07b00069143c48a321af0abd7 2021-03-17 12:02:49,745 DEBUG: Checking out 'train_metrics.json' with cache 'object md5: 66eb73c07b00069143c48a321af0abd7'. 2021-03-17 12:02:49,752 DEBUG: Created 'copy': ../../../cache/66/eb73c07b00069143c48a321af0abd7 -> train_metrics.json 2021-03-17 12:02:49,753 DEBUG: state save (152548593, 1615978969741419264, 126) 66eb73c07b00069143c48a321af0abd7 2021-03-17 12:02:49,799 DEBUG: state save (152549303, 1615978965597293312, 126) b8d9d9ec45c24e1f65474a88d22e1b37 2021-03-17 12:02:49,805 DEBUG: Checking out 'val_metrics.json' with cache 'object md5: b8d9d9ec45c24e1f65474a88d22e1b37'. 2021-03-17 12:02:49,811 DEBUG: Created 'copy': ../../../cache/b8/d9d9ec45c24e1f65474a88d22e1b37 -> val_metrics.json 2021-03-17 12:02:49,812 DEBUG: state save (152549843, 1615978969801421056, 126) b8d9d9ec45c24e1f65474a88d22e1b37 2021-03-17 12:02:49,855 DEBUG: stage: 'train_model' was reproduced Updating lock file 'dvc.lock' 2021-03-17 12:02:50,296 DEBUG: Commit to new experiment branch 'refs/exps/25/6b1cdc9d81d8f2feecd7b4538d526a8ddd28fc/exp-293c9' 2021-03-17 12:02:51,133 DEBUG: Collected experiment 'df26527'. 2021-03-17 12:02:51,135 DEBUG: Removing tmpdir '<ExpTemporaryDirectory '/home/sumo/lwerthen/ms/.dvc/tmp/exps/tmpqnsuwjdk'>' 2021-03-17 12:02:51,135 DEBUG: Removing '/home/sumo/lwerthen/ms/.dvc/tmp/exps/tmpqnsuwjdk' 2021-03-17 12:02:51,319 ERROR: unexpected error - [Errno 39] Directory not empty: '/home/sumo/lwerthen/ms/.dvc/tmp/exps/tmpqnsuwjdk/.git/objects/pack' ------------------------------------------------------------ Traceback (most recent call last): File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/shutil.py", line 654, in _rmtree_safe_fd os.rmdir(entry.name, dir_fd=topfd) OSError: [Errno 39] Directory not empty: 'pack' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/main.py", line 50, in main ret = cmd.run() File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/command/experiments.py", line 525, in run results = self.repo.experiments.run( File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 815, in run return run(self.repo, *args, **kwargs) File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/repo/__init__.py", line 49, in wrapper return f(repo, *args, **kwargs) File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/repo/experiments/run.py", line 28, in run return repo.experiments.reproduce_queued(jobs=jobs) File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 376, in reproduce_queued results = self._reproduce_revs(**kwargs) File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 46, in wrapper return f(exp, *args, **kwargs) File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 522, in _reproduce_revs exec_results = self._executors_repro(executors, **kwargs) File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 648, in _executors_repro executor.cleanup() File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/repo/experiments/executor/local.py", line 74, in cleanup self._tmp_dir.cleanup() File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/repo/experiments/executor/local.py", line 26, in cleanup self._rmtree(self.name) File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/repo/experiments/executor/local.py", line 22, in _rmtree remove(name) File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/utils/fs.py", line 135, in remove shutil.rmtree(path, onerror=_chmod) File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/shutil.py", line 715, in rmtree _rmtree_safe_fd(fd, path, onerror) File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/shutil.py", line 652, in _rmtree_safe_fd _rmtree_safe_fd(dirfd, fullname, onerror) File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/shutil.py", line 652, in _rmtree_safe_fd _rmtree_safe_fd(dirfd, fullname, onerror) File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/shutil.py", line 656, in _rmtree_safe_fd onerror(os.rmdir, fullname, sys.exc_info()) File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/utils/fs.py", line 120, in _chmod func(p) OSError: [Errno 39] Directory not empty: '/home/sumo/lwerthen/ms/.dvc/tmp/exps/tmpqnsuwjdk/.git/objects/pack' ------------------------------------------------------------ 2021-03-17 12:02:52,023 DEBUG: Version info for developers: DVC version: 2.0.6 (pip) --------------------------------- Platform: Python 3.8.5 on Linux-5.4.0-60-generic-x86_64-with-glibc2.10 Supports: hdfs, http, https Cache types: hardlink, symlink Cache directory: nfs4 on [REDACTED] Caches: local Remotes: None Workspace directory: nfs4 on [REDACTED] Repo: dvc, git Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help! 2021-03-17 12:02:52,033 DEBUG: Removing '/home/sumo/lwerthen/ms/.dvc/tmp/exps/tmpssfyvyik' Exception ignored in: <finalize object at 0x7f8865619de0; dead> Traceback (most recent call last): File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/weakref.py", line 566, in __call__ return info.func(*info.args, **(info.kwargs or {})) File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/tempfile.py", line 818, in _cleanup cls._rmtree(name) File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/repo/experiments/executor/local.py", line 22, in _rmtree remove(name) File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/utils/fs.py", line 135, in remove shutil.rmtree(path, onerror=_chmod) File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/shutil.py", line 715, in rmtree _rmtree_safe_fd(fd, path, onerror) File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/shutil.py", line 652, in _rmtree_safe_fd _rmtree_safe_fd(dirfd, fullname, onerror) File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/shutil.py", line 652, in _rmtree_safe_fd _rmtree_safe_fd(dirfd, fullname, onerror) File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/shutil.py", line 656, in _rmtree_safe_fd onerror(os.rmdir, fullname, sys.exc_info()) File "/home/sumo/lwerthen/miniconda3/envs/tf2.4/lib/python3.8/site-packages/dvc/utils/fs.py", line 120, in _chmod func(p) OSError: [Errno 39] Directory not empty: '/home/sumo/lwerthen/ms/.dvc/tmp/exps/tmpssfyvyik/.git/objects/pack' 2021-03-17 12:02:54,322 DEBUG: Analytics is enabled. 2021-03-17 12:02:54,501 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpyel92az0']' 2021-03-17 12:02:54,505 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpyel92az0']' (tf2.4) lwerthen@sumo-ai:~/ms$ ls -al /home/sumo/lwerthen/ms/.dvc/tmp/exps/tmpssfyvyik/.git/objects/pack total 33 drwxr-xr-x 2 lwerthen sumo 2 Mär 17 12:02 . drwxr-xr-x 31 lwerthen sumo 31 Mär 17 12:02 .. (tf2.4) lwerthen@sumo-ai:~/ms$ rmdir /home/sumo/lwerthen/ms/.dvc/tmp/exps/tmpssfyvyik/.git/objects/pack (tf2.4) lwerthen@sumo-ai:~/ms$ echo $? 0 ```

Thanks

Topic		Replies	Views
`dvc exp run` runs multiple experiments by default Questions	2	55	December 4, 2024
Statistical significant stage best practice Questions	9	881	June 9, 2021
Running multiple dvc pipeline in parallel Feature Requests	5	4212	December 12, 2019
Error in pushing experiments Questions	14	219	July 12, 2024
Dvc experiments multiple branches workflow Questions	0	669	March 17, 2022

Multiple dvc runs in parallel

Related topics