I have a pipeline with many parallel stages. I use a combination of a (generated) params.yaml
file and foreach
approach. See below for simplified examples of dvc.yaml
, params.yaml
, and dvc.lock
.
I wrote a simple scheduler in Python to run the stages with dvc repro
in parallel using multiprocessing. Essentially, it’s running running dvc repro setup@sim1
, dvc repro setup@sim2
, etc. in parallel.
From what I understand this should work. However, I’m getting lock errors: “ERROR: Unable to acquire lock. Most likely another DVC process is running or was terminated abruptly.”
dvc.yaml:
stages:
setup:
foreach: ${sims}
do:
cmd: python my_script.py --simlabel ${item.label}
deps:
- my_script.py
outs:
- ${item.work_dir}/outputs.txt
params.yaml
sims:
sim1:
label: label1
parameters:
param1: 1
param2: 2
work_dir: label1_workdir
sim2:
label: label2
parameters:
param1: 1
param2: 2
work_dir: label2_workdir
dvc.lock:
schema: '2.0'
stages:
setup@sim1
cmd: python python my_script.py --simlabel label1
deps:
- path: my_script.py
md5: 614d9f23b42a36de65981aa76026600e
size: 17401
outs:
- path: label1_workdir/outputs.txt
md5: eaf7cdf7540bb96a6307394ad421ad7b
size: 4412
setup@sim2
cmd: python python my_script.py --simlabel label2
deps:
- path: my_script.py
md5: 614d9f23b42a36de65981aa76026600e
size: 17401
outs:
- path: label2_workdir/outputs.txt
md5: eaf7cdf7540bb96a6307394ad421ad7b
size: 4412
1 Like
Hi @maartenb sorry for late response,
So the problem here is that both stages (while different due to parameters use) use the same dependency:
dvc is locking it and hence you get an error. So this is a little bit different use case than the one linked by you in your post, where the pipelines are separate completely.
I am afraid there is no straightforward workaround for that, besides creating several copies of the script and naming them differently, though that would probably not be very elegant.
Having said so, you might be interested in exp
command, which should allow you to achieve your goal:
You could get parallel execution with the following commands:
dvc exp run setup@sim1 --queue
dvc exp run setup@sim2 --queue
dvc exp run -j {2}
Please read on exp
and share your thoughts on whether this is something you might find useful.
I want to run several training pipelines in parallel, but they all depend on the same data. This is giving me the same error, that the data file is locked by the other process. If I run each pipeline as an experiment, will it be easy to merge them back together?
posted in discord:
Hello, I want to run multiple DVC pipelines in parallel which share a dependency. Specifically these are training pipelines which share the data dependency and some code dependency, but have separate output directories. When I try to run dvc repro <pipeline>
twice from separate processes, the second one says that the dependency stage is locked:
Stage 'pipelines/data/dvc.yaml:prepare_dataset@H' didn't change, skipping
ERROR: failed to reproduce 'pipelines/data/dvc.yaml:make_folds@H': '_____/data/folds/H' is busy, it is being blocked by:
(PID 317937): _____dvc repro pipelines/twod/dvc.yaml
If there are no processes with such PIDs, you can manually remove '________/.dvc/tmp/rwlock' and try again.
How can I get around this? It seems like one possibility would be running the pipelines as experiments, but would it be easy to merge the experiment results back together? I think it should be because the pipelines have separate dvc.yaml files and dvc.lock files so I think I could just dvc exp apply
each of the runs individually. Any advice?
For the record: responded in Discord
Hello,
I am running the stages as experiments and they are running in parallel, but they are each reproducing the common dependency.
Here is a sketch of my DAG:
I manually ran train model 1
/ evaluate model 1
, so it made sure prepare dataset
was up to date. I verified this with dvc status
. Then I queue up the other two pipelines with
dvc exp run --queue pipelines/model_2/dvc.yaml
dvc exp run --queue pipelines/model_3/dvc.yaml
But then when I run the two of them with dvc exp run --run-all -j 2
, they both reproduce prepare dataset
. prepare dataset
has a directory output like this: remote://records/${item.p1}/${item.p2}
where item
comes from a matrix stage. The model training stages depend on remote://records/${var}/${item}
where var is defined in the vars
section and item
is from the foreach stage.
Any ideas how to make the training stages not reproduce the dependency which should already be up to date?
Hm, it doesn’t sound right that prepare dataset
gets triggered if it’s already up to date. Is it possible to produce a simple reproducible example of this happening?
I think this was actually an issue on my end, I had an extra dependency in one of my stages. Thanks