This tool looks very promising. However, I am wondering about the functionality of the pipeline feature. From the documentation it seems that I can chain an input file to an output file and each logged operation will be appended to the repro argument when I commit. In other words it is not clear to me how to create a batch process without first defining each step separately.
It is also not clear to me how to handle multiple input and output files.
I am very curious if this would be possible.
Here is an example of an offline analysis that I am performing on human EEG data and behavioral data (using python and java):
sample size × "ASCII_" + n + ".txt" --> Pandas --> SPSS Syntax --> Tables + Figures
--> First Level Summary: Individual Participants (Diagnostics)
/
--<
\
--> Second Level Summary: Group Level
+ --> Behavioral CSV
EEG Raw Binary Data --> EEG Manual Preprocessing --> MATLAB Scripts --> Figures
+ --> EEG CSV
Behavioral CSV --
\
>--> Correlations (SPSS Syntax)
/
EEG CSV --
1 Like
Hi @daniellabbe!
From the documentation it seems that I can chain an input file to an output file and each logged operation will be appended to the repro argument when I commit.
Just to be clear, dvc doesn’t commit anything for you. Here is a simple example that consists of a pipeline with 1 input file(foo), two stages(adding foo: foo.dvc, and running a cp command: foo_copy.dvc) and 1 output file(foo_copy) that should illustrate the repro
feature.
$ mkdir myrepo
$ cd myrepo
$ git init
Initialized empty Git repository in /storage/git/dvc/myrepo/.git/
$ dvc init
$ echo foo > foo
$ dvc add foo
$ cat foo.dvc
md5: e84a017c53974d403499c6b0c7749292
outs:
- cache: true
md5: d3b07384d113edec49eaa6238ad5ff00
metric: false
path: foo
$ dvc run -d foo -o foo_copy cp foo foo_copy
Using 'foo_copy.dvc' as a stage file
Running command:
cp foo foo_copy
$ cat foo_copy.dvc
cmd: cp foo foo_copy
deps:
- md5: d3b07384d113edec49eaa6238ad5ff00
path: foo
md5: 3ab438ecc14092f29e582ef619987063
outs:
- cache: true
md5: d3b07384d113edec49eaa6238ad5ff00
metric: false
path: foo_copy
$ dvc status
$ echo bar > foo
$ dvc status
foo_copy.dvc
deps
changed: foo
foo.dvc
outs
changed: foo
$ dvc repro foo_copy.dvc
Reproducing 'foo.dvc'
Verifying data sources in 'foo.dvc'
Reproducing 'foo_copy.dvc'
Running command:
cp foo foo_copy
$ cat foo_copy
bar
$ cat foo.dvc
md5: 753ff1b2c89ba8f51058313870c02ef1
outs:
- cache: true
md5: c157a79031e1c40f85931829bc5fc552
metric: false
path: foo
$ cat foo_copy.dvc
cmd: cp foo foo_copy
deps:
- md5: c157a79031e1c40f85931829bc5fc552
path: foo
md5: c74166ccdffb1d2e598d879c57961df9
outs:
- cache: true
md5: c157a79031e1c40f85931829bc5fc552
metric: false
path: foo_copy
how to create a batch process without first defining each step separately.
It is also not clear to me how to handle multiple input and output files.
You need to describe each step of your pipeline to dvc once, there is no way around it. There are two ways to create a step in your pipeline(notice that examples use multiple dependencies and outputs):
- using CLI tools like
dvc add
, dvc run
etc. E.g.
$ dvc run -f mystage.dvc -d mydependency1 -d mydependency2 -o myoutput1 -o myoutput2 ./mycmd
- manually writing dvcfiles(simple yaml format) for your steps. E.g.:
$ cat mystage.dvc
cmd: ./mycmd
outs:
- path: myoutput1
- path: myoutput2
deps:
- path: mydependency1
- path: mydependency2
The dependency scheme that you’ve shown is indeed possible to implement with dvc
Thanks,
Ruslan
1 Like
Thank you Ruslan for your swift reply!
I’m looking forward to trying this out!