Simplifying `dvc run` and pipelines

I am trying to learn and understand DVC. From a beginner’s perspective, it seems to me that the command dvc run is a bit overloaded with meaning and functionality (and also it does not have a clear name that shows what it really does). It also seems to have some kind of overlapping (in functionality) with dvc add.

This command currently seems a bit like a mini-monolithic command that does more than one thing. So, I would propose to decouple and simplify it a bit. This may be done in several steps.

  1. Do not do any data tracking with dvc run, do it only with dvc add. For example, instead of running:

    dvc run -d data.xml -o results -f stage1 <command>
    

    we should run:

    dvc add data.xml
    dvc add results
    dvc run -d data.xml.dvc -o results.dvc -f stage1 <command>
    

    The result of this is that instead of having just the file stage.dvc that tracks both dependencies and outputs (as well as storing the command), we would have files data.xml.dvc and results.dvc (that track the dependency and output data files) as well as the file stage1.dvc which stores the command and refers to the .dvc files of dependency and output. In other words, the file stage1.dvc will not contain the checksums of the data files and will look like this:

    md5: 4b8cbf1b4b282cd87db9d34a46daf3da
    cmd: <command>
    wdir: .
    deps:
    - data.xml.dvc
    outs:
    - results.dvc
    

    Also, the dvc run command will modify the output files (in this case results.dvc) by adding to them a field that shows that this data file is generated by stage1. It can be like this: stage: stage1.dvc

  2. Rename the command dvc run to dvc stage create (or dvc stage add). Its function is only to create the file stage1.dvc (without executing the command). The name of the stage should be mandatory, and this command can be called like this:

    dvc stage create stage1.dvc \
        -d data.xml.dvc \
        -o results.dvc \
        -cmd <command>
    
  3. To actually execute a stage, we can run:

    dvc stage run stage1.dvc
    

    This of course will execute all the previous stages, according to dependencies (and checking the tracked data if they need to be refreshed).

  4. The extension of stage files and data-tracking files does not have to be the same, we can use extension .stage for stage files (for example stage1.stage). Or maybe we can use the same extension (.dvc) for both types of files, but have a field on them that shows what type they are (like type: stage or type: data). It seems to me that it is better if they have different file extensions.

  5. Maybe we can rename the command dvc add to dvc data add, for being consistent, but I am not sure about this.

One of the benefits of this restructuring (besides being more easy to understand and use) is that the format of stage files would become very simple, so that they can be created and edited easily with any other tool (for example a bash script), or even manually (with an editor).

Another benefit of making stage definitions simpler is that it becomes possible to put all the stage definitions in a single pipeline file. This pipeline file may look similar to a Makefile (and plays a similar role). For example it may look like this:

results1.dvc: data1.xml.dvc data2.xml.dvc
    cmd: <command>
    wdir: .

results2.dvc: results1.dvc
    cmd: <command>
    wdir: .

This file may be called Dvcfile by default, but may also have a different name. One of the benefits of having the whole pipeline defined in a single file is that it is easy to create alternative pipelines that have small modifications from the first one (I think that some users have asked for something like this in the past).

A benefit of having the pipeline syntax similar to that of Makefile is that many people will be familiar with it and will understand it immediately.

Maybe the Makefile itself can be used for the pipeline, if we could customize the way that it checks for outdated files (to look at the checksums as well, besides the timestamps), but I don’t think this can be done easily.

makepp: http://makepp.sourceforge.net/ (mentioned in this comment: https://github.com/iterative/dvc/issues/1018#issuecomment-521140083) may be a good replacement for make, since it uses md5 checksums for tracking inputs and outputs. The drawback is that it is written in Perl.

Another good solution might be doit, which is Python based and extensible:

I don’t see the benefits on adding this complexity.

I do like the line of thought of this suggestion. dvc run does too many things.

This creates a conflict with dvc repro though, which kind of serves the same purpose.

I do agree there’s some confusion between simple DVC-file vs. stage files. There’s a few discussions about these terms on the dvc.org repo in fact… But having 2 different extensions also sounds like adding unnecessary complexity for such a small difference. Not sure, maybe you’re right here.

Not sure the references between stage files would be easy to manage manually.

DVC discovers the pipeline(s) by examining all the stages and rebuilding any DAGs on-the-fly, from what I understand, so no need for a pipeline file.

But dvc repro stage1.dvc can be a “syntactic sugar” for dvc stage run --repro stage1.dvc. This way there is no conflict between them; they can be both valid and useful.

Anyway, I am not able to think of all the consequences of such a UI redesign. Maybe it creates more problems that it tries to solve. But maybe they are small issues that can be worked out.