I am trying to learn and understand DVC. From a beginner’s perspective, it seems to me that the command dvc run
is a bit overloaded with meaning and functionality (and also it does not have a clear name that shows what it really does). It also seems to have some kind of overlapping (in functionality) with dvc add
.
This command currently seems a bit like a mini-monolithic command that does more than one thing. So, I would propose to decouple and simplify it a bit. This may be done in several steps.
-
Do not do any data tracking with
dvc run
, do it only withdvc add
. For example, instead of running:dvc run -d data.xml -o results -f stage1 <command>
we should run:
dvc add data.xml dvc add results dvc run -d data.xml.dvc -o results.dvc -f stage1 <command>
The result of this is that instead of having just the file
stage.dvc
that tracks both dependencies and outputs (as well as storing the command), we would have filesdata.xml.dvc
andresults.dvc
(that track the dependency and output data files) as well as the filestage1.dvc
which stores the command and refers to the.dvc
files of dependency and output. In other words, the filestage1.dvc
will not contain the checksums of the data files and will look like this:md5: 4b8cbf1b4b282cd87db9d34a46daf3da cmd: <command> wdir: . deps: - data.xml.dvc outs: - results.dvc
Also, the
dvc run
command will modify the output files (in this caseresults.dvc
) by adding to them a field that shows that this data file is generated bystage1
. It can be like this:stage: stage1.dvc
-
Rename the command
dvc run
todvc stage create
(ordvc stage add
). Its function is only to create the filestage1.dvc
(without executing the command). The name of the stage should be mandatory, and this command can be called like this:dvc stage create stage1.dvc \ -d data.xml.dvc \ -o results.dvc \ -cmd <command>
-
To actually execute a stage, we can run:
dvc stage run stage1.dvc
This of course will execute all the previous stages, according to dependencies (and checking the tracked data if they need to be refreshed).
-
The extension of stage files and data-tracking files does not have to be the same, we can use extension
.stage
for stage files (for examplestage1.stage
). Or maybe we can use the same extension (.dvc
) for both types of files, but have a field on them that shows what type they are (liketype: stage
ortype: data
). It seems to me that it is better if they have different file extensions. -
Maybe we can rename the command
dvc add
todvc data add
, for being consistent, but I am not sure about this.
One of the benefits of this restructuring (besides being more easy to understand and use) is that the format of stage files would become very simple, so that they can be created and edited easily with any other tool (for example a bash script), or even manually (with an editor).
Another benefit of making stage definitions simpler is that it becomes possible to put all the stage definitions in a single pipeline file. This pipeline file may look similar to a Makefile (and plays a similar role). For example it may look like this:
results1.dvc: data1.xml.dvc data2.xml.dvc
cmd: <command>
wdir: .
results2.dvc: results1.dvc
cmd: <command>
wdir: .
This file may be called Dvcfile
by default, but may also have a different name. One of the benefits of having the whole pipeline defined in a single file is that it is easy to create alternative pipelines that have small modifications from the first one (I think that some users have asked for something like this in the past).
A benefit of having the pipeline syntax similar to that of Makefile
is that many people will be familiar with it and will understand it immediately.
Maybe the Makefile
itself can be used for the pipeline, if we could customize the way that it checks for outdated files (to look at the checksums as well, besides the timestamps), but I don’t think this can be done easily.