Using DVC to keep track of multiple model variants

I have to keep track of models that use the same feature engineering, the same features, the same algorithm, but a different subset of data e.g.the raw data is the same but then it’s filtered to a certain group and then make a model.

My repo structure looks like this

root
     |- models
                   |- model_group1
                                            |- model.xgb
                                            | params.json
                   |- model_group2
                                            |- ...
                   |- ...
    |- metrics
                  |- model_group1
                                           |- model_eval.json
                  |- model_group2
                                           |- ...
                  |- ...

AFAIK DVC assumes the pipeline produces one model, which makes total sense.
Making one model in which the group is a variant is not possible in my case.

Any suggestions? do I need multiple DVC pipelines in the same repo?

@jcpsantiago how many models do you have? Is it ok to run separate commands for each of the models like:

$ dvc run -p model/group1/params.json:mysection1 -o model/group1/model.xgb -o python my_xgb.py
$ dvc run -p model/group2/params.json:mysection2 -o model/group1/model.pth - python my_pytorch.py
...

We are thinking about introducing some generalization in dvc.yaml (might be something like loops). It would be great if you can show what solution would work the best for you. It could be a great starting point for the generalization. Please share your thoughts.

At the moment I have 3 models, but this will start growing fast.
More context: My current pipeline uses the drake package (I use R) which is also a DAG like DVC minus the storage backends and some other nice things. So far I’ve been keeping a models.yaml which I use to keep track of which models should exist and keep some parameters needed to deploy everything to AWS Sagemaker. I started using DVC to fetch the (common) training data and keep track of it plus the /models dir where the serialized models live. But now I saw this https://www.youtube.com/watch?v=xPncjKH6SPk and I want those nice reports too :smiley:

I think my ideal solution would be to have some sort of looping mechanism, because all these models share the same structure (filenames, file structure, etc) – only difference are the file paths and a dplyr::filter() step in the pipeline. The command is almost the same too: Rscript run_pipeline.R <group_name>.

Doing dvc repro with a lot of variants must be very costly though.

I’ll try adding separate commands for each and see what I don’t like in it and report back.

1 Like

Thank you for the detailed explanation!

I’d really appreciate it if you try and share your thoughts. It might give us good ideas before jumping to implementation :slight_smile:

@dmitry I tried it and I’m not a fan, there’s too much duplication in the yaml file and I feel it’s mixing two things: the pipeline/DAG and the parameters needed to run it. These could be nicely decoupled.

Here’s what I would like to be able to do:

dvc_params:
  group1:
    name: "group 1"
    something_else: "foo"
  group2:
    name: "group 2"
    something_else: "meh"
  group3:
    name: "group 3"
    something_else: "gwah"

stages:
  get_some_data:
    cmd: ./get_data.sh {{dvc_params.name}}
    deps:
      - get_data.sh
      - query.sql
    outs:
      - data.csv
  train_model:
    cmd: Rscript train.R {{dvc_params.name}} {{dvc_params.something_else}}
    deps:
      - data.csv
    outs:
      - models/{{dvc_params.name}}/model.xgb
    metrics:
      - metrics/{{dvc_params.name}}/model_eval.json

DVC would loop the dvc_params and run the DAG replacing the values in {{ }}.
Now one could ask if you loop through the DAG stage-by-stage or the whole DAG for each element of dvc-params. I don’t have an opinion at the moment :smiley:

I hope this illustrates my thoughts a bit better :smiley:

1 Like

@jcpsantiago that’s super helpful! The scenario makes sense we just need to work on details - the loop syntax and so on.

There is an open issue with a great discussion around parameters: https://github.com/iterative/dvc/issues/3633
I’d appreciate it if you can jump in and bring your example to the discussion. It will give the loop/deduplication perspective to the variable problem. So, we can come up with a holistic approach.

Done! I’m continuing the discussion in the github issue.

Thank you @jcpsantiago! Very much appreciated.

The major question for now - what is the priority of the issue and when it will be implemented. I’ll talk with the team about this. All the design questions can be answered during the implementation - in the same issue discussion.

If I can help further e.g. more writing more user-stories etc please let me know.