The examples I reviewed in the documentation seem to describe how to define, share and reproduce a train model pipeline. Once we are happy with our trained model and we want to move it into production, what would be the recommended approach to use DVC to ensure the pipeline consistency between train and predict?
I would like to re-use the DVC pipeline defined for training (feature engineering, processing,…) to ensure consistency and proper usage. On the other hand, the pipeline would also be somewhat different (each individual model script would “predict” instead on “training”).
Is there a recommended solution ? Should I create additional variables to tell each script whether to train or predict ? What to make of the metric at the end?
Hi @jcrousse !
Thank you for your patience We don’t have any recommended approach for that yet, but your idea with additional vars (i imagine it would be some env var, right? e.g.
PREDICT=1 dvc repro) should work. That would work nicely if for your pipeline it is just a matter of flicking a switch to go from training to prediction, otherwise it might get tricky and you would have to build a separate pipeline or somehow figureout the way to make current one work with the help of some additional flags. Are you mostly talking about using that in kind of a “production” setting? Or will it be something that you would want to keep in your project?
Thanks for the answer
Yes the use case is exactly to use the pipeline in production.
Once the model DAG becomes a bit complex, we would like to have only one DAG definition (the DVC one) and not to re-produce a similar DAG outside of DVC.
Otherwise there is too much or a risk to create mistakes or inconsistencies between the two.