Right architecture for daily training

Hello,
I’m trying to understand something like a DVC way to construct a system with daily training.
In the current configuration, I have a few environments (development, staging, etc.). Every night, there is a scheduled job that executes the training and deploys a newly trained model. I want to use DVC to track models and the data used to train these models (and later, I want to track changes between models, but it’s not the main point yet).
So here are the main questions:

  1. Am I supposed to make a daily Git commit from the scheduled job to be able to keep each model and data?

  2. Is it preferable to have a separate repository to track all the DVC files (to avoid polluting the main Git repository)? Maybe a Git repository for each environment?

  3. Finally, is DVC a good tool for this kind of system? I really like using DVC when models are trained directly by data scientists. However, I’m a bit confused about how to properly construct a system with continuous training. Everything I can imagine looks too complex. Am I missing some documentation or examples?

Thanks for your help

Let’s start with your last question. Is DVC a good fit? For simply retraining, you already have something like a cron job, and DVC may be overkill, but since you are looking for more to keep track of each of those retraining runs across multiple environments, DVC might be a fit.

You may want to take a look at our model registry in What is a Model Registry. It uses our collaboration hub Studio, but that is mostly a convenience to help you manage and visualize the state of the models. All the underlying functionality can be done with DVC and our other open-source tool GTO (GTO), which will use Git tags to manage what model version is associated with each environment.

Without knowing the details of your workflow, I imagine it could look something like:

  1. Create a branch or fork of your development repo to track the automated training, or an entirely separate repo depending on how connected they are.
  2. Each night, run the training and commit the changes.
  3. Use GTO to register a new model version.
  4. During the deployment job for each environment, use GTO to update the model version in each environment upon success.

So once retraining is successful, your job might first assign the new model to the dev stage. Then you may have a job that uses GTO to find the latest dev model and deploy it to staging. Once that’s successful, it assigns the model to staging, etc.

2 Likes

@dberenbaum , thanks for your answer. And I’m sorry for a late reply. Okay, I see the main idea for the approach. About the “During the deployment job for each environment, use GTO to update the model version in each environment upon success.”, in my case I use different datasets depending on the environment, what means that I run a training per environment, and there is a model per environment per day. Would you advise me to have something like branch per environment in this case?

I want to use DVC only to track input data and artifacts.

Thanks for the details @odv! A branch per environment sounds reasonable to me.

@dberenbaum , Okay, I think I’m starting to understand the idea.
Could you also give me a few hints about the fork usage?
Right now I’m using only tags on the main branch to create the releases. So this workflow definitely won’t match with the branches setup. That’s why I’m curious to see about the forks, that you’ve mentioned earlier. Do you update the fork on each “code release” and keep there only models changes, so it allows you to have something like a separation between code and data?

I don’t have a specific workflow in mind, but hearing what you want to do, I’m not immediately seeing much difference between branches and forks. In either case, you could keep separation by only having making code changes on one branch/fork and then merging them downstream to other branches/forks where you have different datasets. If you want further separation, you could package the code in one repo and then import it in the repos where you actually do the training.

Thanks for your help!