I am trying to create an end-to-end pipeline ( I am new to it … ), and trying to use dvc as well.
I have create 5 script corresponding to each phase
- 1st Feature engineering
- train-test data splite
- 2nd Feature engineering
- Model training
I could do 'dvc run bla bla bla ’ for each program file. and I could reproduce the result.
However if I would like to re-run the “WHOLE” pipeline after I have twitched something, is there any “best practice” or recommendation or guide line (e.g. use shell script to wrap up all ‘dvc run’ commands and run it ? use dvc run -d xx.data -d yy.data … -d fe1.py -d fe2.py -d main.py …, to run a very bulky command to make a dvc file ? ) ?
and. … do creating separate script for each phase a good idea to manage the pipeline ?
Looking forward for answer ! Many thanks !