Creating an aggregate .dvc file

I have the following use case:

  • Launch a script which loops over multiple folders and does some processing (each folder takes quite a bit of CPU time)
  • Each folder leads to the creation of a new folder, containing the output of the processing
  • Once the loop is complete, a single .dvc file is created

This works fine, but if the processing crashes mid-way, dvc repro will start again from scratch.
One solution would be to create a .dvc file for each folder, but then, repro is a bit tedious to run as I would need to loop over all the .dvc file.

Is there a better way to go ? If not, would it be possible to define multiple layers of dvc files (e.g. run subtasks, create multiple .dvc, once subtask is complete, create a master .dvc which will allow to easily repro all the subtasks) ?

Thanks in advance !

Hi @tmain !

Very interesting scenario! I suppose your script is able to pick up from where it crashed? dvc repro removes outputs of the stage before reproducing to ensure that it was indeed fully produced by the command and could be reproduced from the specified dependencies in the future. If your script is able to pick up from where it failed and you really don’t want to have it running all over again, I would propose something like this:

First, create .dvc file for each directory:

$ dvc run -d idir1 -o odir1 script.py
...
$ dvc run -d idirN -o odirN script.py

And then tie them down with a master dummy Dvcfile:

$ dvc run -d odir1 ... -d odirN -f Dvcfile

And now to reproduce all of them, you could simply run

$ dvc repro

Would this trick suit your scenario?

There is also a possibility to introduce some option for dvc repro that will not remove outputs before running the command(e.g. --no-cleanup or smth like that), but this option seems a bit dangerous, but might be required in some scenarios where you can’t split the stage into substages. Would something like this suit your scenario better?

Thanks,
Ruslan

I suppose your script is able to pick up from where it crashed?

Yes it is. But I think it is better to leave the resume up to dvc (let’s say we have 10 dvc run commands and there is a crash a command 5, then I will automatically call dvc repro for the first 5, and dvc run for the last 5)

Would this trick suit your scenario?

I will try and let you know, but from the looks of it, it definitely should.

After a bit more thinking, I am not sure whether tying the outputs of all subfolders in a new stage is really helping to solve the problem.

The command:

dvc run -d odir1 … -d odirN -f Dvcfile

will only work if each odir has already been created.
This means that before I run this command, I must run

for folder in list_:
    dvc run -d folder -o odir myscript.py

In that case, I see no clear advantage to running dvc run -d odir1 ... -d odirN -f Dvcfile for repro compared to re-using the for loop with:

for folder in list_:
    dvc repro -d folder -o odir myscript.py

However, if there was a way for DVC to understand all the dependencies of a stage so that we’d be able to call

dvc run -d odir1 ... -d odirN -f Dvcfile which would automatically call dvc run -d folder -o odir myscript.py for each folder in the list, then that would be quite useful !

You can also use --no-exec option for dvc run to build the pipeline first and then run dvc repro:

for folder in list_:
    dvc run --no-exec -d folder -o odir myscript.py

dvc run --no-exec -d odir1 … -d odirN -f Dvcfile

dvc repro

Please let us know if this works for you.

Thanks,
Ruslan

This does not work in case the odir folders are created by myscript (Error: missing dependency)

Ah, right. That is because dvc run checks that the dependency exists first. I think in --no-exec case it shouldn’t do that. This is a regression introduced in 0.18.15. Created https://github.com/iterative/dvc/issues/1113 to track the progress on it. I’ll send a patch soon and will release it either as 0.18.16 or 0.19 by the end of the week. In the mean time, you can try downgrading to 0.18.14 and giving it a try.

Thanks,
Ruslan

Hi @tmain !

I took a closer look and turned out I was wrong, --no-exec works as expected. Here is the script that I’m using

#!/bin/bash

set -e
set -x

rm -rf myrepo
mkdir myrepo
cd myrepo

git init
dvc init
git commit -s -m"Init"

for d in {1..3}; do
    mkdir idir$d
    echo $d > idir$d/file
    dvc run --no-exec -d idir$d -o odir$d cp idir$d odir$d -a
    DEPS="$DEPS -d odir$d"
done

dvc run --no-exec $DEPS -f Dvcfile

dvc repro
# Reproducing 'odir2.dvc'
# Running command:
#         cp idir2 odir2 -a
# Adding 'odir2' to '.gitignore'.
# Saving 'odir2' to cache '.dvc/cache'.
# Created 'hardlink': .dvc/cache/26/ab0db90d72e28ad0ba1e22ee510510 -> odir2/file
# Saving information to 'odir2.dvc'.
# Reproducing 'odir3.dvc'
# Running command:
#         cp idir3 odir3 -a
# Adding 'odir3' to '.gitignore'.
# Saving 'odir3' to cache '.dvc/cache'.
# Created 'hardlink': .dvc/cache/6d/7fce9fee471194aa8b5b6e47267f03 -> odir3/file
# Saving information to 'odir3.dvc'.
# Reproducing 'odir1.dvc'
# Running command:
#         cp idir1 odir1 -a
# Adding 'odir1' to '.gitignore'.
# Saving 'odir1' to cache '.dvc/cache'.
# Created 'hardlink': .dvc/cache/b0/26324c6904b2a9cb4b88d6d61c81d1 -> odir1/file
# Saving information to 'odir1.dvc'.
# Reproducing 'Dvcfile'
# Running command:
#
# Saving information to 'Dvcfile'.

dvc pipeline show --ascii
# .-----------.           .-----------.           .-----------.
# | odir2.dvc |           | odir3.dvc |           | odir1.dvc |
# `-----------'****       `-----------'        ***`-----------'
#                  ****          *         ****
#                      ****      *     ****
#                          ***   *   **
#                          .---------.
#                          | Dvcfile |
#                          `---------'

which works perfectly for me. Could please share more details about where you receive that error?

Thanks,
Ruslan

It does work for me as well !

I was running something slightly different (i.e. not running the no-exec at the dummy master Dvcfile):

#!/bin/bash

set -e
set -x

rm -rf myrepo
mkdir myrepo
cd myrepo

git init
dvc init
git commit -s -m"Init"

for d in {1..3}; do
    mkdir idir$d
    echo $d > idir$d/file
    dvc run --no-exec -d idir$d -o odir$d cp idir$d odir$d -a
    DEPS="$DEPS -d odir$d"
done

dvc run $DEPS -f Dvcfile

Error: Failed to run command: missing dependencies: odir1, odir2, odir3

Also, the script below fails, which may or may not be intended behaviour:

#!/bin/bash

set -e
set -x

rm -rf myrepo
mkdir myrepo
cd myrepo

git init
dvc init
git commit -s -m"Init"

for d in {1..3}; do
    mkdir idir$d
    echo $d > idir$d/file
    dvc run --no-exec -d idir$d -o odir$d cp idir$d odir$d -a
    DEPS="$DEPS -d odir$d"
done

dvc run --no-exec $DEPS -f Dvcfile

dvc repro --dry

Error: Failed to reproduce ‘Dvcfile’: Failed to reproduce ‘Dvcfile’: missing dependencies: odir1, odir2, odir3

In light of the above, I feel like running through my whole pipeline with --no-exec and then calling dvc repro would be the best practice. I would very much appreciate any suggestions to keep on improving the workflow though.

Hi @tmain !

I was running something slightly different (i.e. not running the no-exec at the dummy master Dvcfile ):

...
dvc run $DEPS -f Dvcfile 

That is because dvc run only runs the specified command and doesn’t go running dependency stages, which is a function of dvc repro. In the example above you first build the pipeline with dvc run --no-exec and then run it with dvc repro.

Also, the script below fails, which may or may not be intended behaviour:

dvc repro --dry

Good point! Repurposed repro: don't check for missing dependencies when running with `--dry` · Issue #1113 · iterative/dvc · GitHub to track the progress on it.

I would very much appreciate any suggestions to keep on improving the workflow though.

It is hard to recommend anything in advance(though we are working on Best Practices section for our docs), so please feel free to ask any questions and we will do our best to help you.

Thank you so much for the feedback!

-Ruslan

Hi,

I have a variant of the use case above, which I can’t seem to get working:

#!/bin/bash

set -e
set -x

rm -rf myrepo
mkdir myrepo
cd myrepo

git init
dvc init
git commit -s -m"Init"

mkdir -p main_data/data/procdata
mkdir -p main_data/data/rawdata
for d in {1..2}; do
    echo $d > main_data/data/rawdata/file$d.txt
done

for d in {1..2}; do
    dvc run --no-exec -c main_data/data/procdata -d ../rawdata/file$d.txt -o file$d.txt cp ../rawdata/file$d.txt .
    DEPS="$DEPS -d file$d.txt"
done

echo $DEPS
dvc run --no-exec -c main_data/data $DEPS -o procdata

dvc repro main_data/data/procdata.dvc

yields: Error: Failed to reproduce 'main_data/data/procdata.dvc': Failed to reproduce 'main_data/data/procdata.dvc': missing dependencies: main_data/data/file1.txt, main_data/data/file2.txt

I suspect this is due to the change of directory (the -c command) ?

This script fixes the issue (adapted from https://gist.github.com/mroutis/d6778c7582b5ec75f3cf339ce90e7a36)

#!/bin/bash

# Create repository
rm -rf myrepo
mkdir myrepo
cd myrepo

# Init Git and DVC
git init
dvc init
git commit -s -m"Init"

# Create data directory with ("proccesed" data and "raw" data)
mkdir -p main_data/data/procdata
mkdir -p main_data/data/rawdata

# Create initial rawdata
#
#   echo 1 > main_data/data/rawdata/file1.txt
#   echo 2 > main_data/data/rawdata/file2.txt
#
for d in {1..2}; do
    echo $d > main_data/data/rawdata/file$d.txt
done

# Create two `dvc` files: (file1.txt.dvc && file2.txt.dvc)
#
# - Change directory to `main/data/procdata`
# - Dependency: `main_data/data/rawdata/file1.txt
# - Output:     `main_data/data/procdata/file1.txt
# - Command:    `cp main_data/data/rawdata/file1.txt main_data/data/procdata
for d in {1..2}; do
    dvc run \
      --no-exec \
      -c main_data/data/procdata \
      -d ../rawdata/file$d.txt \
      -o file$d.txt \
      cp ../rawdata/file$d.txt .

    DEPS="$DEPS -d procdata/file$d.txt"
done

echo "Dependency flags: $DEPS"
# Create a `dvc` file (procdata.dvc)
#
# - Change directory to `main_data/data`
# - Dependencies: `main_data/data/procdata/file1.txt`
# - Dependencies: `main_data/data/procdata/file2.txt`
# - Output:       `main_data/data/procdata
#
# The output from this command is the directory from which it depends,
# creating a circular dependency
#
# dvc run \
#   --no-exec \
#   -c main_data/data \
#   $DEPS \
#   -o procdata

# dvc repro main_data/data/procdata.dvc

# Below is the correct form
dvc run \
  --no-exec \
  -c main_data/data \
  $DEPS \
  -f data.dvc

dvc repro main_data/data/data.dvc
1 Like