Question on using dvc for end-to-end pipeline


#1

HI,

I am trying to create an end-to-end pipeline ( I am new to it … ), and trying to use dvc as well.

I have create 5 script corresponding to each phase

  1. preprocessing
  2. 1st Feature engineering
  3. train-test data splite
  4. 2nd Feature engineering
  5. Model training
    I could do 'dvc run bla bla bla ’ for each program file. and I could reproduce the result.

However if I would like to re-run the “WHOLE” pipeline after I have twitched something, is there any “best practice” or recommendation or guide line (e.g. use shell script to wrap up all ‘dvc run’ commands and run it ? use dvc run -d xx.data -d yy.data … -d fe1.py -d fe2.py -d main.py …, to run a very bulky command to make a dvc file ? ) ?

and. … do creating separate script for each phase a good idea to manage the pipeline ?

Looking forward for answer ! Many thanks !

Best Regards,
Solomon Leung


#2

Hi @solomonleung !

So your pipeline consists of only 5 stages(i.e. 5 dvc runs), right? dvc repro 5.dvc would look through all of the dependency stages(1.dvc … 4.dvc) and reproduce anything that has changed, effectively re-running the whole pipeline. You can also force re-running by adding a --force flag to dvc repro command. We also have -P|--all-pipelines option for dvc repro that will reproduce all pipelines present in your project. Please feel free to correct me, if I didn’t get your scenario right :slight_smile:

Yes, it is a recommended way of organizing your pipeline, so each dvc run has its own script to run that is corresponding to the action(e.g. preprocess, train, etc). :slight_smile:

Thanks,
Ruslan


#3

Hi @kupruser

Thanks :slight_smile:
If I have created a “config.py” and “util.py” scripts and every phrase of my script reference to it.

  1. do I need to declare dependency with config.py and util.py for each dvc run ?
  2. If I make changes on util.py (e.g. add a new class which be used in final phase … ) would the whole pipeline re-run … ?

Looking forward to your reply :slight_smile:

Thanks,
Solomon


#4

Hi @solomonleung !

  1. Yes, if every stage of your pipeline is using it. But you could also declare it once in one of the first stages, and use -p|–pipeline with dvc repro to re-run the whole pipeline, it is clearly a trade-of though.
  2. Yes, it would. Dvc only checks checksums for the dependencies, so if there is anything changed in the dependency then it will re-run the stage that depends on it.

Thanks,
Ruslan


#5

HI @kupruser ,

Thanks again :):muscle:
And got some question again …

  1. if I accidentally run “dvc run” in a wrong place … (e.g. /data), the .dvc file will created at the path where I have run “dvc run”, what would happen if I have the dvc file to another location (e.g. / ) ? will it break the pipeline ?

  2. My pipeline is still improving … and the output -o and dependency -d of some stages would change … ( e.g. change from .csv to .pk1 , or add new output in some stages ) , when that happen, can the dvc file be modify ? or I need to re-create the whole pipeline ? :(:scream: , or there is a way to remove some stages in pipeline ?

Regards,
Solomon


#6

Hi @solomonleung

  1. As long as outputs of those stages don’t overlap, you’ll just have two similar stages in your pipeline.
  2. Sure, you can just open any .dvc file with your favorite editor and add/remove/modify parameters in it(command, outputs, dependencies, etc). Any .dvc file is in a simple YAML format, so it is pretty easy to modify it. See https://dvc.org/doc/user-guide/dvc-file-format . You can also remove any stage with dvc remove --purge stage.dvc. See https://dvc.org/doc/commands-reference/remove .

Thanks,
Ruslan


#7

Thanks Rusian :slight_smile: