Running multiple dvc pipeline in parallel

Hello,

I have been using DVC to manage ML pipeline for several last days and I had a pretty good experience. Especially, I like the way the outputs are cached and the dependencies are managed.

However, I have been trying to run multiple pipelines simultaneously. This is useful, for example, to test several different parameter values and make use of multiple CPUs/GPUs available. The runs are completely independent (both deps and outs are different) so there is no risk of overwriting results.

However, dvc repro, does not allow to run more than a single pipeline at time. I managed to get around this by removing the .dvc/lock file, but this causes the error in the pipeline:

ERROR: cannot perform the command because another DVC process seems to be running on this project. If that is not the case, manually remove .dvc/lock and try again.

Is there a better way to go around this limitation?

Hello @btel. I don’t think there is some workaround allowing you to run multiple instances in parallel. However, @kupruser is currently working on this functionality. For some reason discourse prevents me from attaching a github link, so you need to build it manually /iterative/dvc/pull/2584 - here you can see the progress of this task.

1 Like

Here is the full link (let’s see if it works for me ;-)):

Thanks for the quick reply! It’s exactly what I am looking for. Let me know if there is any way I could help with it (like testing ;-)).

1 Like

Sure! We would love to get some feedback as soon as it is released. I will ping you when its done. :slight_smile:

1 Like

Hi,

Perhaps the only current workaround would be to make a copy of the project somewhere else and run it from there. Not ideal, or course, but with proper Git branching, data duplication could be avoided. Obviously, the feature request is the real solution but just wanted to mention this idea :slight_smile:

Best

1 Like

Thanks for your helpful ideas. For the moment, I simply remove the lock file manually. I only run stages in parallel that do not share outputs or do not have mutual dependencies. This corrupts the sqlite database from time to time, but it’s quick to regenerate, as long as you keep you keep calm. :relaxed:

However, you trick with having multiple directories is very useful when I want to run stages and investigate the results of previous runs at the same time.

In general my experience with DVC is very good so far. It nicely streamlines the managmenet of dependencies between stages and version control of large files. It also superbly integrates with git. Thanks for the great work! :hugs:

I am now only waiting for the PR to get merged! :ok_hand:

2 Likes