I have a complex pipeline pattern and am looking for any DVC templates/cookbook-like information to jumpstart my project.
I’m running monthly training cycles for over 1K models, for counties within states, using a jupyter notebook. For monthly runs, I first adjust the main code cell with the specifics for the month. I then create multiple notebooks and open each in its own kernel to enable running multiple threads, segmenting the input file of 1K targets to groups using python syntax, e.g., [128:428]. The notebook compiles source input and model metrics information for each county.
Each notebook main cycles through the its portion of the 1K model targets, passing the county specific settings to the following pattern: Each model training cycle takes data in from multiple S3 sources, transforms the data into a single dataframe that is then used for training (train, test, validate sets). Multiple prediction sets are then generated and pushed to S3.
Due to county variations, my ML feature engineering dev pattern needs to be similar where I measure the addition of new features across multiple counties, aggregating multiple model performance evaluation metrics.
Has anyone used or seen a similar pattern I might be able to leverage? I need to replicate the looping and aggregation of data for the full 1K run.