Launch a script which loops over multiple folders and does some processing (each folder takes quite a bit of CPU time)
Each folder leads to the creation of a new folder, containing the output of the processing
Once the loop is complete, a single .dvc file is created
This works fine, but if the processing crashes mid-way, dvc repro will start again from scratch.
One solution would be to create a .dvc file for each folder, but then, repro is a bit tedious to run as I would need to loop over all the .dvc file.
Is there a better way to go ? If not, would it be possible to define multiple layers of dvc files (e.g. run subtasks, create multiple .dvc, once subtask is complete, create a master .dvc which will allow to easily repro all the subtasks) ?
Very interesting scenario! I suppose your script is able to pick up from where it crashed? dvc repro removes outputs of the stage before reproducing to ensure that it was indeed fully produced by the command and could be reproduced from the specified dependencies in the future. If your script is able to pick up from where it failed and you really don’t want to have it running all over again, I would propose something like this:
First, create .dvc file for each directory:
$ dvc run -d idir1 -o odir1 script.py
...
$ dvc run -d idirN -o odirN script.py
And then tie them down with a master dummy Dvcfile:
$ dvc run -d odir1 ... -d odirN -f Dvcfile
And now to reproduce all of them, you could simply run
$ dvc repro
Would this trick suit your scenario?
There is also a possibility to introduce some option for dvc repro that will not remove outputs before running the command(e.g. --no-cleanup or smth like that), but this option seems a bit dangerous, but might be required in some scenarios where you can’t split the stage into substages. Would something like this suit your scenario better?
I suppose your script is able to pick up from where it crashed?
Yes it is. But I think it is better to leave the resume up to dvc (let’s say we have 10 dvc run commands and there is a crash a command 5, then I will automatically call dvc repro for the first 5, and dvc run for the last 5)
Would this trick suit your scenario?
I will try and let you know, but from the looks of it, it definitely should.
After a bit more thinking, I am not sure whether tying the outputs of all subfolders in a new stage is really helping to solve the problem.
The command:
dvc run -d odir1 … -d odirN -f Dvcfile
will only work if each odir has already been created.
This means that before I run this command, I must run
for folder in list_:
dvc run -d folder -o odir myscript.py
In that case, I see no clear advantage to running dvc run -d odir1 ... -d odirN -f Dvcfile for repro compared to re-using the for loop with:
for folder in list_:
dvc repro -d folder -o odir myscript.py
However, if there was a way for DVC to understand all the dependencies of a stage so that we’d be able to call
dvc run -d odir1 ... -d odirN -f Dvcfile which would automatically call dvc run -d folder -o odir myscript.py for each folder in the list, then that would be quite useful !
Ah, right. That is because dvc run checks that the dependency exists first. I think in --no-exec case it shouldn’t do that. This is a regression introduced in 0.18.15. Created https://github.com/iterative/dvc/issues/1113 to track the progress on it. I’ll send a patch soon and will release it either as 0.18.16 or 0.19 by the end of the week. In the mean time, you can try downgrading to 0.18.14 and giving it a try.
I was running something slightly different (i.e. not running the no-exec at the dummy master Dvcfile):
#!/bin/bash
set -e
set -x
rm -rf myrepo
mkdir myrepo
cd myrepo
git init
dvc init
git commit -s -m"Init"
for d in {1..3}; do
mkdir idir$d
echo $d > idir$d/file
dvc run --no-exec -d idir$d -o odir$d cp idir$d odir$d -a
DEPS="$DEPS -d odir$d"
done
dvc run $DEPS -f Dvcfile
Error: Failed to run command: missing dependencies: odir1, odir2, odir3
Also, the script below fails, which may or may not be intended behaviour:
#!/bin/bash
set -e
set -x
rm -rf myrepo
mkdir myrepo
cd myrepo
git init
dvc init
git commit -s -m"Init"
for d in {1..3}; do
mkdir idir$d
echo $d > idir$d/file
dvc run --no-exec -d idir$d -o odir$d cp idir$d odir$d -a
DEPS="$DEPS -d odir$d"
done
dvc run --no-exec $DEPS -f Dvcfile
dvc repro --dry
Error: Failed to reproduce ‘Dvcfile’: Failed to reproduce ‘Dvcfile’: missing dependencies: odir1, odir2, odir3
In light of the above, I feel like running through my whole pipeline with --no-exec and then calling dvc repro would be the best practice. I would very much appreciate any suggestions to keep on improving the workflow though.
I was running something slightly different (i.e. not running the no-exec at the dummy master Dvcfile ):
...
dvc run $DEPS -f Dvcfile
That is because dvc run only runs the specified command and doesn’t go running dependency stages, which is a function of dvc repro. In the example above you first build the pipeline with dvc run --no-exec and then run it with dvc repro.
Also, the script below fails, which may or may not be intended behaviour:
I would very much appreciate any suggestions to keep on improving the workflow though.
It is hard to recommend anything in advance(though we are working on Best Practices section for our docs), so please feel free to ask any questions and we will do our best to help you.