I am trying to learn and understand DVC. From a beginner’s perspective, it seems to me that the command
dvc run is a bit overloaded with meaning and functionality (and also it does not have a clear name that shows what it really does). It also seems to have some kind of overlapping (in functionality) with
This command currently seems a bit like a mini-monolithic command that does more than one thing. So, I would propose to decouple and simplify it a bit. This may be done in several steps.
Do not do any data tracking with
dvc run, do it only with
dvc add. For example, instead of running:
dvc run -d data.xml -o results -f stage1 <command>
we should run:
dvc add data.xml dvc add results dvc run -d data.xml.dvc -o results.dvc -f stage1 <command>
The result of this is that instead of having just the file
stage.dvcthat tracks both dependencies and outputs (as well as storing the command), we would have files
results.dvc(that track the dependency and output data files) as well as the file
stage1.dvcwhich stores the command and refers to the
.dvcfiles of dependency and output. In other words, the file
stage1.dvcwill not contain the checksums of the data files and will look like this:
md5: 4b8cbf1b4b282cd87db9d34a46daf3da cmd: <command> wdir: . deps: - data.xml.dvc outs: - results.dvc
dvc runcommand will modify the output files (in this case
results.dvc) by adding to them a field that shows that this data file is generated by
stage1. It can be like this:
Rename the command
dvc stage create(or
dvc stage add). Its function is only to create the file
stage1.dvc(without executing the command). The name of the stage should be mandatory, and this command can be called like this:
dvc stage create stage1.dvc \ -d data.xml.dvc \ -o results.dvc \ -cmd <command>
To actually execute a stage, we can run:
dvc stage run stage1.dvc
This of course will execute all the previous stages, according to dependencies (and checking the tracked data if they need to be refreshed).
The extension of stage files and data-tracking files does not have to be the same, we can use extension
.stagefor stage files (for example
stage1.stage). Or maybe we can use the same extension (
.dvc) for both types of files, but have a field on them that shows what type they are (like
type: data). It seems to me that it is better if they have different file extensions.
Maybe we can rename the command
dvc data add, for being consistent, but I am not sure about this.
One of the benefits of this restructuring (besides being more easy to understand and use) is that the format of stage files would become very simple, so that they can be created and edited easily with any other tool (for example a bash script), or even manually (with an editor).
Another benefit of making stage definitions simpler is that it becomes possible to put all the stage definitions in a single pipeline file. This pipeline file may look similar to a Makefile (and plays a similar role). For example it may look like this:
results1.dvc: data1.xml.dvc data2.xml.dvc cmd: <command> wdir: . results2.dvc: results1.dvc cmd: <command> wdir: .
This file may be called
Dvcfile by default, but may also have a different name. One of the benefits of having the whole pipeline defined in a single file is that it is easy to create alternative pipelines that have small modifications from the first one (I think that some users have asked for something like this in the past).
A benefit of having the pipeline syntax similar to that of
Makefile is that many people will be familiar with it and will understand it immediately.
Makefile itself can be used for the pipeline, if we could customize the way that it checks for outdated files (to look at the checksums as well, besides the timestamps), but I don’t think this can be done easily.