I am trying to create an end-to-end pipeline ( I am new to it … ), and trying to use dvc as well.
I have create 5 script corresponding to each phase
- 1st Feature engineering
- train-test data splite
- 2nd Feature engineering
- Model training
I could do 'dvc run bla bla bla ’ for each program file. and I could reproduce the result.
However if I would like to re-run the “WHOLE” pipeline after I have twitched something, is there any “best practice” or recommendation or guide line (e.g. use shell script to wrap up all ‘dvc run’ commands and run it ? use dvc run -d xx.data -d yy.data … -d fe1.py -d fe2.py -d main.py …, to run a very bulky command to make a dvc file ? ) ?
and. … do creating separate script for each phase a good idea to manage the pipeline ?
Looking forward for answer ! Many thanks !
Hi @solomonleung !
So your pipeline consists of only 5 stages(i.e. 5
dvc runs), right?
dvc repro 5.dvc would look through all of the dependency stages(1.dvc … 4.dvc) and reproduce anything that has changed, effectively re-running the whole pipeline. You can also force re-running by adding a
--force flag to
dvc repro command. We also have
-P|--all-pipelines option for
dvc repro that will reproduce all pipelines present in your project. Please feel free to correct me, if I didn’t get your scenario right
Yes, it is a recommended way of organizing your pipeline, so each
dvc run has its own script to run that is corresponding to the action(e.g. preprocess, train, etc).
If I have created a “config.py” and “util.py” scripts and every phrase of my script reference to it.
- do I need to declare dependency with config.py and util.py for each dvc run ?
- If I make changes on util.py (e.g. add a new class which be used in final phase … ) would the whole pipeline re-run … ?
Looking forward to your reply
HI @kupruser ,
Thanks again :)
And got some question again …
if I accidentally run “dvc run” in a wrong place … (e.g. /data), the .dvc file will created at the path where I have run “dvc run”, what would happen if I have the dvc file to another location (e.g. / ) ? will it break the pipeline ?
My pipeline is still improving … and the output -o and dependency -d of some stages would change … ( e.g. change from .csv to .pk1 , or add new output in some stages ) , when that happen, can the dvc file be modify ? or I need to re-create the whole pipeline ? :( , or there is a way to remove some stages in pipeline ?