Using DVC to keep track of multiple model variants

jcpsantiago · August 18, 2020, 2:18pm

I have to keep track of models that use the same feature engineering, the same features, the same algorithm, but a different subset of data e.g.the raw data is the same but then it’s filtered to a certain group and then make a model.

My repo structure looks like this

root
     |- models
                   |- model_group1
                                            |- model.xgb
                                            | params.json
                   |- model_group2
                                            |- ...
                   |- ...
    |- metrics
                  |- model_group1
                                           |- model_eval.json
                  |- model_group2
                                           |- ...
                  |- ...

AFAIK DVC assumes the pipeline produces one model, which makes total sense.
Making one model in which the group is a variant is not possible in my case.

Any suggestions? do I need multiple DVC pipelines in the same repo?

dmitry · August 18, 2020, 2:42pm

@jcpsantiago how many models do you have? Is it ok to run separate commands for each of the models like:

$ dvc run -p model/group1/params.json:mysection1 -o model/group1/model.xgb -o python my_xgb.py
$ dvc run -p model/group2/params.json:mysection2 -o model/group1/model.pth - python my_pytorch.py
...

We are thinking about introducing some generalization in dvc.yaml (might be something like loops). It would be great if you can show what solution would work the best for you. It could be a great starting point for the generalization. Please share your thoughts.

jcpsantiago · August 18, 2020, 2:58pm

At the moment I have 3 models, but this will start growing fast.
More context: My current pipeline uses the drake package (I use R) which is also a DAG like DVC minus the storage backends and some other nice things. So far I’ve been keeping a models.yaml which I use to keep track of which models should exist and keep some parameters needed to deploy everything to AWS Sagemaker. I started using DVC to fetch the (common) training data and keep track of it plus the /models dir where the serialized models live. But now I saw this https://www.youtube.com/watch?v=xPncjKH6SPk and I want those nice reports too

I think my ideal solution would be to have some sort of looping mechanism, because all these models share the same structure (filenames, file structure, etc) – only difference are the file paths and a dplyr::filter() step in the pipeline. The command is almost the same too: Rscript run_pipeline.R <group_name>.

Doing dvc repro with a lot of variants must be very costly though.

I’ll try adding separate commands for each and see what I don’t like in it and report back.

dmitry · August 18, 2020, 3:10pm

Thank you for the detailed explanation!

I’d really appreciate it if you try and share your thoughts. It might give us good ideas before jumping to implementation

jcpsantiago · August 19, 2020, 5:30pm

@dmitry I tried it and I’m not a fan, there’s too much duplication in the yaml file and I feel it’s mixing two things: the pipeline/DAG and the parameters needed to run it. These could be nicely decoupled.

Here’s what I would like to be able to do:

dvc_params:
  group1:
    name: "group 1"
    something_else: "foo"
  group2:
    name: "group 2"
    something_else: "meh"
  group3:
    name: "group 3"
    something_else: "gwah"

stages:
  get_some_data:
    cmd: ./get_data.sh {{dvc_params.name}}
    deps:
      - get_data.sh
      - query.sql
    outs:
      - data.csv
  train_model:
    cmd: Rscript train.R {{dvc_params.name}} {{dvc_params.something_else}}
    deps:
      - data.csv
    outs:
      - models/{{dvc_params.name}}/model.xgb
    metrics:
      - metrics/{{dvc_params.name}}/model_eval.json

DVC would loop the dvc_params and run the DAG replacing the values in {{ }}.
Now one could ask if you loop through the DAG stage-by-stage or the whole DAG for each element of dvc-params. I don’t have an opinion at the moment

I hope this illustrates my thoughts a bit better

dmitry · August 19, 2020, 7:59pm

@jcpsantiago that’s super helpful! The scenario makes sense we just need to work on details - the loop syntax and so on.

There is an open issue with a great discussion around parameters: https://github.com/iterative/dvc/issues/3633
I’d appreciate it if you can jump in and bring your example to the discussion. It will give the loop/deduplication perspective to the variable problem. So, we can come up with a holistic approach.

jcpsantiago · August 20, 2020, 6:36am

Done! I’m continuing the discussion in the github issue.

dmitry · August 20, 2020, 5:56pm

Thank you @jcpsantiago! Very much appreciated.

The major question for now - what is the priority of the issue and when it will be implemented. I’ll talk with the team about this. All the design questions can be answered during the implementation - in the same issue discussion.

jcpsantiago · August 21, 2020, 7:02am

If I can help further e.g. more writing more user-stories etc please let me know.

Topic		Replies	Views
Coding Patterns Development	3	1093	April 28, 2020
140-stage DVC pipeline getting hard to work with Questions	2	328	June 1, 2023
Composing params.yaml and dvc.yaml to manage multiple independent pipeline variants Questions	0	14	July 11, 2025
Statistical significant stage best practice Questions	9	881	June 9, 2021
Once a model is trained, can the DAG be re-used for prediction? Questions	2	733	September 26, 2019

Using DVC to keep track of multiple model variants

Related topics