140-stage DVC pipeline getting hard to work with

Hello there! I have been using DVC in my team for more than a year now and we are quite happy with it. We use it solely for data and model versioning.

We currently maintain more than 30 deep learning models, resulting in 140 DVC stages. Because a lot of the stages are quite similar, we defined the dvc.yaml file as followed:

stages:
  create_dataset:
    foreach: ${create_dataset_list}
    do:
      cmd:
      - python ../../libs/stats/push_pull_datasets.py pull ../../datasets/${item.vg}/stats/${item.stat}/${item.dataset_yaml} ../../artifacts/${item.vg}/stats/${item.stat}/${item.folder_images}
        --ssh ${matrix} --force
      - python ../../scripts/stats/create_dataset/${item.script} ../../datasets/${item.vg}/stats/${item.stat}/${item.dataset_yaml} ../../artifacts/${item.vg}/stats/${item.stat}/${item.folder_images}
        --params htc/training/tweaks/${item.vg}/stats/${item.stat}/${item.training_tweaks} --output ../../artifacts/${item.vg}/stats/${item.stat}/${item.output}
      deps:
      - ../../scripts/stats/create_dataset/${item.script}
      - ../../datasets/${item.vg}/stats/${item.stat}/${item.dataset_yaml}
      - ../../training/tweaks/${item.vg}/stats/${item.stat}/${item.training_tweaks}
      outs:
      - ../../artifacts/${item.vg}/stats/${item.stat}/${item.output}

  train_model:
     ...

  finetune_model:
     ... 

  compute_metrics:
     ...

Which is powered by the params.yaml file defining all the different parameters of every stages.

create_dataset_list:

- vg: league-of-legends
  stat: champions
  script: standard_dataset_creation.py
  dataset_yaml: trainset.yaml
  folder_images: trainset_images
  training_tweaks: create_trainset.py
  output: trainset.h5

- ... # on and on, we have a total of 140 stage parametrisation

While this way of doing things is working, it is getting not super convenient to work with:

  • params.yaml file is 1300 lines long, each model pipeline stage parametrisation being split into the different stages (e.g., line 1 I have the create_dataset parametrisation of model 1, and line 750 I have the train_model parametrisation of model 1, they are not close to each other). Ideally I would like to have it split for all the different models I have (e.g., params_model1.yaml, params_model2.yaml, etc.)
  • it is hard to only run dvc repro on a single stage as the stage itself is named like create_dataset@x where x is a number. To know the right value x I need to go to the dvc.lock file and find the right one which is super tedious. Ideally I would like to do dvc repro -s create_dataset@model1_trainset

I think one way to do it would be to remove the dynamic instantiations of the stages, and simply write the stages with the good namings on separate files. However, it would add a lot of code duplicate, and declaring a new stage will be harder to do. Right now I only have to add a new block into the params.yaml file and that’s it.

But maybe there is another way I am not seeing? I would love to have inputs on this question from you.

Thank you very much :slight_smile:

1 Like

I have found a way to somewhat achieve what I want, but it’s not fully working. But maybe can be worth sharing with you, as a starting point.

Regarding the naming of my stages, in the params.yaml file, instead of having a list of each stage, I could have a dict like this:

create_dataset_list:
  model_1:
    vg: league-of-legends
    stat: champions
    script: standard_dataset_creation.py
    dataset_yaml: trainset.yaml
    folder_images: trainset_images
    training_tweaks: create_trainset.py
    output: trainset.h5

This seems to do the trick, as it appears in the output of dvc status.

As for splitting the params.yaml file and grouping the the paramtrizations of the stages by model, I found this way. In dvc.yaml I can add:

vars:
  - model_1_params.yaml

So that parametrization can be searched in this file in addition to the default params.yaml. The idea here would be to add as many files as I have models. While it seems to read the file, I am now running into a problem. If I format my model_1_params.yaml file as follows, it will throw the following error cannot redefine 'create_dataset_list' from 'model_1_params.yaml' as it already exists in 'params.yaml'.

# model_1_params.yaml
create_dataset_list:
  model_1:
    vg: league-of-legends
    stat: champions
    script: standard_dataset_creation.py
    dataset_yaml: trainset.yaml
    folder_images: trainset_images
    training_tweaks: create_trainset.py
    output: trainset.h5

I feel I am close to having a working solution. Do you have any idea?

Hi @dotXem , I hope the suggestion doesn’t disrupt your current initiatives too much, but your use case sounds like a good fit for Hydra.
Consider taking a quick look at Hydra Composition | Data Version Control · DVC and Hydra itself to see if it makes sense for you