Coding Patterns

I have a complex pipeline pattern and am looking for any DVC templates/cookbook-like information to jumpstart my project.

I’m running monthly training cycles for over 1K models, for counties within states, using a jupyter notebook. For monthly runs, I first adjust the main code cell with the specifics for the month. I then create multiple notebooks and open each in its own kernel to enable running multiple threads, segmenting the input file of 1K targets to groups using python syntax, e.g., [128:428]. The notebook compiles source input and model metrics information for each county.

Each notebook main cycles through the its portion of the 1K model targets, passing the county specific settings to the following pattern: Each model training cycle takes data in from multiple S3 sources, transforms the data into a single dataframe that is then used for training (train, test, validate sets). Multiple prediction sets are then generated and pushed to S3.

Due to county variations, my ML feature engineering dev pattern needs to be similar where I measure the addition of new features across multiple counties, aggregating multiple model performance evaluation metrics.

Has anyone used or seen a similar pattern I might be able to leverage? I need to replicate the looping and aggregation of data for the full 1K run.

Hi,

This is a template project you may find useful: https://github.com/iKintosh/cookiecutter-data-science

We have several interesting articles listed in https://dvc.org/doc/understanding-dvc/resources#articles as well, some of them may give you ideas.

As for your pipeline, from what I understand your recurring pipeline consists of a single training+metrics stage, but with changing hyperparameters for each month, and 1K times for different data that is kept up to date on S3. I’m probably oversimplifying but I’m only able to give you a broad idea of what I would recommend in this situation (and other people may have better suggestions):

  1. Version your hyperparameters in a parameters file with Git in a DVC repo. See https://dvc.org/doc/command-reference/params
  2. Put the training source code in the repo as well, and make it into a stage with dvc run: https://dvc.org/doc/command-reference/run — I’d consider splitting into 2 stages actually: training and evaluation (metrics). Make sure to register the correct dependencies (see 3. below), params, outputs, and metrics for the stage(s)
  3. Either as part of your stage definition (2. above), register the external dependencies directly held in the S3 bucket, so you don’t have to move them from there.
  4. Each month change the parameters file and use dvc repro to do a training cycle.

The only part I didn’t cover and which sounds quite tricky is the data segmentation and multi-threading. An option would be to have 1K tags in the Git repo, each one with a slighly different stage definition that will only process it’s part of the data, but this seems overkill to setup.

I hope at least I give you an idea to encourage you to jump into using DVC. I recommend you reach us directly at dvc.org/chat with more specific questions on how to achieve each goal in your process one by one once you’re using DVC. We’re very responsive over there.

Thanks

@jays thanks for the question. The pipeline looks a bit complex :sweat_smile: Before answering I have a couple of questions:

  1. Re the new feature - are you adding a new feature for all the counties or for a subset of these?
  2. How do you evaluate a new feature is good? Just by comparing to the previous results (county by county) or you have some general/single metrics?
  3. Had you tried to train a single ML model for all the counties? My feeling - it very is easy to overfit with 1K models.

PS: thanks @jorgeorpinel for the answer. the params suggestion might be very helpful.

@dmitry @jorgeorpinel Thank you both for the great responses.

  1. Currently, new features are applied to all models. I will be creating market segment models, e.g., tuning distinct feature sets/parameters for condos, single family homes, and other market segments within counties, which is part of my interest in DVC so I can build, run & track multiple derivatives without creating sets of overlapping notebooks.

  2. Evaluation is with auc, log loss and f1. My dev process is to first test a new feature on a set of 9 diverse counties and see how it impacts each county as well as the aggregated scores for the 9 counties. I then scale up to 128 counties and check aggregated scores. I then stand up a parallel run along side the main 1k county production process. I’ve architected the pipeline (my 2k line notebook :slight_smile: into functions that are essentially stages in the pipeline that can be turned on and off. I do the 1k production run with all stages turned on and then do the parallel run(s) with the merge and transform stages turned off so that the parallel runs use the transform output as a starting point and execute an experimental transform function stage where the new features are built. This also enables me to have multiple model stages (both alternate and sequential for ensembles). If all goes will, I then promote the experimental features to the production transform stage.

  3. My SMEs (the CEO and CIO) believe that due to county by county differences in the source data, unique county models are mandatory. Here again, I want to leverage DVC to test state level models. Small scale tests along these lines have yielded mixed results. My primary model is XGBoost and I have experimental models using Logistic Regression and AutoML that I want to weave into an ensemble. DNNs are on the list, but not feasible for several months.

Params are certainly an element I need to implement. I was considering implementing the pyenv library before I discovered DVC.

Thank you for the pointers to the docs. I’ll review them and start piping questions to dvc.org/chat.

Thanks again for the great responses!

1 Like