Lock error with parallelized dvc repro

I have a pipeline with many parallel stages. I use a combination of a (generated) params.yaml file and foreach approach. See below for simplified examples of dvc.yaml, params.yaml, and dvc.lock.

I wrote a simple scheduler in Python to run the stages with dvc repro in parallel using multiprocessing. Essentially, it’s running running dvc repro setup@sim1, dvc repro setup@sim2, etc. in parallel.

From what I understand this should work. However, I’m getting lock errors: “ERROR: Unable to acquire lock. Most likely another DVC process is running or was terminated abruptly.”

dvc.yaml:

stages:
  setup:
    foreach: ${sims}
    do:
      cmd: python my_script.py --simlabel ${item.label}
      deps:
        - my_script.py
      outs:
        - ${item.work_dir}/outputs.txt

params.yaml

sims:
	sim1:
		label: label1
		parameters:
		  param1: 1
		  param2: 2
		work_dir: label1_workdir
	sim2:
		label: label2
		parameters:
		  param1: 1
		  param2: 2
		work_dir: label2_workdir

dvc.lock:

schema: '2.0'
stages:
  setup@sim1
    cmd: python python my_script.py --simlabel label1
    deps:
    - path: my_script.py
      md5: 614d9f23b42a36de65981aa76026600e
      size: 17401
    outs:
    - path: label1_workdir/outputs.txt
      md5: eaf7cdf7540bb96a6307394ad421ad7b
      size: 4412
  setup@sim2
    cmd: python python my_script.py --simlabel label2
    deps:
    - path: my_script.py
      md5: 614d9f23b42a36de65981aa76026600e
      size: 17401
    outs:
    - path: label2_workdir/outputs.txt
      md5: eaf7cdf7540bb96a6307394ad421ad7b
      size: 4412
1 Like

Hi @maartenb sorry for late response,
So the problem here is that both stages (while different due to parameters use) use the same dependency:

  • my_script.py

dvc is locking it and hence you get an error. So this is a little bit different use case than the one linked by you in your post, where the pipelines are separate completely.
I am afraid there is no straightforward workaround for that, besides creating several copies of the script and naming them differently, though that would probably not be very elegant.

Having said so, you might be interested in exp command, which should allow you to achieve your goal:

You could get parallel execution with the following commands:

dvc exp run setup@sim1 --queue
dvc exp run setup@sim2 --queue
dvc exp run -j {2}

Please read on exp and share your thoughts on whether this is something you might find useful.

I want to run several training pipelines in parallel, but they all depend on the same data. This is giving me the same error, that the data file is locked by the other process. If I run each pipeline as an experiment, will it be easy to merge them back together?

posted in discord:

Hello, I want to run multiple DVC pipelines in parallel which share a dependency. Specifically these are training pipelines which share the data dependency and some code dependency, but have separate output directories. When I try to run dvc repro <pipeline> twice from separate processes, the second one says that the dependency stage is locked:

Stage 'pipelines/data/dvc.yaml:prepare_dataset@H' didn't change, skipping
ERROR: failed to reproduce 'pipelines/data/dvc.yaml:make_folds@H': '_____/data/folds/H' is busy, it is being blocked by:
  (PID 317937): _____dvc repro pipelines/twod/dvc.yaml

If there are no processes with such PIDs, you can manually remove '________/.dvc/tmp/rwlock' and try again.

How can I get around this? It seems like one possibility would be running the pipelines as experiments, but would it be easy to merge the experiment results back together? I think it should be because the pipelines have separate dvc.yaml files and dvc.lock files so I think I could just dvc exp apply each of the runs individually. Any advice?

For the record: responded in Discord

Hello,

I am running the stages as experiments and they are running in parallel, but they are each reproducing the common dependency.

Here is a sketch of my DAG:

I manually ran train model 1 / evaluate model 1, so it made sure prepare dataset was up to date. I verified this with dvc status. Then I queue up the other two pipelines with

dvc exp run --queue pipelines/model_2/dvc.yaml
dvc exp run --queue pipelines/model_3/dvc.yaml

But then when I run the two of them with dvc exp run --run-all -j 2, they both reproduce prepare dataset. prepare dataset has a directory output like this: remote://records/${item.p1}/${item.p2} where item comes from a matrix stage. The model training stages depend on remote://records/${var}/${item} where var is defined in the vars section and item is from the foreach stage.

Any ideas how to make the training stages not reproduce the dependency which should already be up to date?

Hm, it doesn’t sound right that prepare dataset gets triggered if it’s already up to date. Is it possible to produce a simple reproducible example of this happening?

I think this was actually an issue on my end, I had an extra dependency in one of my stages. Thanks