Batch Pipeline Support and Multi I/O

#1

This tool looks very promising. However, I am wondering about the functionality of the pipeline feature. From the documentation it seems that I can chain an input file to an output file and each logged operation will be appended to the repro argument when I commit. In other words it is not clear to me how to create a batch process without first defining each step separately.

It is also not clear to me how to handle multiple input and output files.

I am very curious if this would be possible.

Here is an example of an offline analysis that I am performing on human EEG data and behavioral data (using python and java):

sample size × "ASCII_" + n + ".txt" --> Pandas --> SPSS Syntax --> Tables + Figures

    --> First Level Summary: Individual Participants (Diagnostics)
   /
--<
   \
    --> Second Level Summary: Group Level

    + --> Behavioral CSV


EEG Raw Binary Data --> EEG Manual Preprocessing --> MATLAB Scripts --> Figures

    + --> EEG CSV


 Behavioral CSV --
                  \
                   >--> Correlations (SPSS Syntax)
                  /
        EEG CSV --
1 Like

#2

Hi @daniellabbe!

From the documentation it seems that I can chain an input file to an output file and each logged operation will be appended to the repro argument when I commit.

Just to be clear, dvc doesn’t commit anything for you. Here is a simple example that consists of a pipeline with 1 input file(foo), two stages(adding foo: foo.dvc, and running a cp command: foo_copy.dvc) and 1 output file(foo_copy) that should illustrate the repro feature.

$ mkdir myrepo
$ cd myrepo
$ git init
Initialized empty Git repository in /storage/git/dvc/myrepo/.git/
$ dvc init
$ echo foo > foo
$ dvc add foo
$ cat foo.dvc
md5: e84a017c53974d403499c6b0c7749292
outs:
- cache: true
  md5: d3b07384d113edec49eaa6238ad5ff00
  metric: false
  path: foo
$ dvc run -d foo -o foo_copy cp foo foo_copy
Using 'foo_copy.dvc' as a stage file
Running command:
        cp foo foo_copy
$ cat foo_copy.dvc
cmd: cp foo foo_copy
deps:
- md5: d3b07384d113edec49eaa6238ad5ff00
  path: foo
md5: 3ab438ecc14092f29e582ef619987063
outs:
- cache: true
  md5: d3b07384d113edec49eaa6238ad5ff00
  metric: false
  path: foo_copy
$ dvc status
$ echo bar > foo
$ dvc status
foo_copy.dvc
        deps
                changed:  foo
foo.dvc
        outs
                changed:  foo
$ dvc repro foo_copy.dvc
Reproducing 'foo.dvc'
Verifying data sources in 'foo.dvc'
Reproducing 'foo_copy.dvc'
Running command:
        cp foo foo_copy
$ cat foo_copy
bar
$ cat foo.dvc
md5: 753ff1b2c89ba8f51058313870c02ef1
outs:
- cache: true
  md5: c157a79031e1c40f85931829bc5fc552
  metric: false
  path: foo
$ cat foo_copy.dvc
cmd: cp foo foo_copy
deps:
- md5: c157a79031e1c40f85931829bc5fc552
  path: foo
md5: c74166ccdffb1d2e598d879c57961df9
outs:
- cache: true
  md5: c157a79031e1c40f85931829bc5fc552
  metric: false
  path: foo_copy

how to create a batch process without first defining each step separately.
It is also not clear to me how to handle multiple input and output files.

You need to describe each step of your pipeline to dvc once, there is no way around it. There are two ways to create a step in your pipeline(notice that examples use multiple dependencies and outputs):

  1. using CLI tools like dvc add, dvc run etc. E.g.
$ dvc run -f mystage.dvc -d mydependency1 -d mydependency2 -o myoutput1 -o myoutput2 ./mycmd
  1. manually writing dvcfiles(simple yaml format) for your steps. E.g.:
$ cat mystage.dvc
cmd: ./mycmd
outs:
 - path: myoutput1
 - path: myoutput2
deps:
 - path: mydependency1
 - path: mydependency2

The dependency scheme that you’ve shown is indeed possible to implement with dvc :slight_smile:

Thanks,
Ruslan

1 Like

#3

Thank you Ruslan for your swift reply!
I’m looking forward to trying this out!

0 Likes