Using DVC for end-to-end pipeline

solomonleung · December 25, 2018, 2:31pm

HI,

I am trying to create an end-to-end pipeline ( I am new to it … ), and trying to use dvc as well.

I have create 5 script corresponding to each phase

preprocessing
1st Feature engineering
train-test data splite
2nd Feature engineering
Model training
I could do 'dvc run bla bla bla ’ for each program file. and I could reproduce the result.

However if I would like to re-run the “WHOLE” pipeline after I have twitched something, is there any “best practice” or recommendation or guide line (e.g. use shell script to wrap up all ‘dvc run’ commands and run it ? use dvc run -d xx.data -d yy.data … -d fe1.py -d fe2.py -d main.py …, to run a very bulky command to make a dvc file ? ) ?

and. … do creating separate script for each phase a good idea to manage the pipeline ?

Looking forward for answer ! Many thanks !

Best Regards,
Solomon Leung

kupruser · December 25, 2018, 3:02pm

Hi @solomonleung !

So your pipeline consists of only 5 stages(i.e. 5 dvc runs), right? dvc repro 5.dvc would look through all of the dependency stages(1.dvc … 4.dvc) and reproduce anything that has changed, effectively re-running the whole pipeline. You can also force re-running by adding a --force flag to dvc repro command. We also have -P|--all-pipelines option for dvc repro that will reproduce all pipelines present in your project. Please feel free to correct me, if I didn’t get your scenario right

Yes, it is a recommended way of organizing your pipeline, so each dvc run has its own script to run that is corresponding to the action(e.g. preprocess, train, etc).

Thanks,
Ruslan

solomonleung · January 1, 2019, 1:06pm

Hi @kupruser

Thanks
If I have created a “config.py” and “util.py” scripts and every phrase of my script reference to it.

do I need to declare dependency with config.py and util.py for each dvc run ?
If I make changes on util.py (e.g. add a new class which be used in final phase … ) would the whole pipeline re-run … ?

Looking forward to your reply

Thanks,
Solomon

kupruser · January 1, 2019, 1:47pm

Hi @solomonleung !

Yes, if every stage of your pipeline is using it. But you could also declare it once in one of the first stages, and use -p|–pipeline with dvc repro to re-run the whole pipeline, it is clearly a trade-of though.
Yes, it would. Dvc only checks checksums for the dependencies, so if there is anything changed in the dependency then it will re-run the stage that depends on it.

Thanks,
Ruslan

solomonleung · January 1, 2019, 5:28pm

HI @kupruser ,

Thanks again :)
And got some question again …

if I accidentally run “dvc run” in a wrong place … (e.g. /data), the .dvc file will created at the path where I have run “dvc run”, what would happen if I have the dvc file to another location (e.g. / ) ? will it break the pipeline ?
My pipeline is still improving … and the output -o and dependency -d of some stages would change … ( e.g. change from .csv to .pk1 , or add new output in some stages ) , when that happen, can the dvc file be modify ? or I need to re-create the whole pipeline ? :( , or there is a way to remove some stages in pipeline ?

Regards,
Solomon

kupruser · January 1, 2019, 8:21pm

Hi @solomonleung

As long as outputs of those stages don’t overlap, you’ll just have two similar stages in your pipeline.
Sure, you can just open any .dvc file with your favorite editor and add/remove/modify parameters in it(command, outputs, dependencies, etc). Any .dvc file is in a simple YAML format, so it is pretty easy to modify it. See https://dvc.org/doc/user-guide/dvc-file-format . You can also remove any stage with dvc remove --purge stage.dvc. See https://dvc.org/doc/commands-reference/remove .

Thanks,
Ruslan

solomonleung · January 5, 2019, 2:41am

Thanks Rusian

Topic		Replies	Views
Once a model is trained, can the DAG be re-used for prediction? Questions	2	663	September 26, 2019
Creating an aggregate .dvc file Questions	11	3159	October 17, 2018
Whole directory as input or output Questions	4	1816	September 27, 2019
Pipeline to process only new files Questions	2	682	July 24, 2020
Best practice for applying pipelines to many datasets? Questions	3	472	April 6, 2022

Using DVC for end-to-end pipeline

Related Topics