Hello there! I have been using DVC in my team for more than a year now and we are quite happy with it. We use it solely for data and model versioning.
We currently maintain more than 30 deep learning models, resulting in 140 DVC stages. Because a lot of the stages are quite similar, we defined the dvc.yaml
file as followed:
stages:
create_dataset:
foreach: ${create_dataset_list}
do:
cmd:
- python ../../libs/stats/push_pull_datasets.py pull ../../datasets/${item.vg}/stats/${item.stat}/${item.dataset_yaml} ../../artifacts/${item.vg}/stats/${item.stat}/${item.folder_images}
--ssh ${matrix} --force
- python ../../scripts/stats/create_dataset/${item.script} ../../datasets/${item.vg}/stats/${item.stat}/${item.dataset_yaml} ../../artifacts/${item.vg}/stats/${item.stat}/${item.folder_images}
--params htc/training/tweaks/${item.vg}/stats/${item.stat}/${item.training_tweaks} --output ../../artifacts/${item.vg}/stats/${item.stat}/${item.output}
deps:
- ../../scripts/stats/create_dataset/${item.script}
- ../../datasets/${item.vg}/stats/${item.stat}/${item.dataset_yaml}
- ../../training/tweaks/${item.vg}/stats/${item.stat}/${item.training_tweaks}
outs:
- ../../artifacts/${item.vg}/stats/${item.stat}/${item.output}
train_model:
...
finetune_model:
...
compute_metrics:
...
Which is powered by the params.yaml
file defining all the different parameters of every stages.
create_dataset_list:
- vg: league-of-legends
stat: champions
script: standard_dataset_creation.py
dataset_yaml: trainset.yaml
folder_images: trainset_images
training_tweaks: create_trainset.py
output: trainset.h5
- ... # on and on, we have a total of 140 stage parametrisation
While this way of doing things is working, it is getting not super convenient to work with:
- params.yaml file is 1300 lines long, each model pipeline stage parametrisation being split into the different stages (e.g., line 1 I have the create_dataset parametrisation of model 1, and line 750 I have the train_model parametrisation of model 1, they are not close to each other). Ideally I would like to have it split for all the different models I have (e.g., params_model1.yaml, params_model2.yaml, etc.)
- it is hard to only run
dvc repro
on a single stage as the stage itself is named likecreate_dataset@x
wherex
is a number. To know the right valuex
I need to go to thedvc.lock
file and find the right one which is super tedious. Ideally I would like to dodvc repro -s create_dataset@model1_trainset
I think one way to do it would be to remove the dynamic instantiations of the stages, and simply write the stages with the good namings on separate files. However, it would add a lot of code duplicate, and declaring a new stage will be harder to do. Right now I only have to add a new block into the params.yaml
file and that’s it.
But maybe there is another way I am not seeing? I would love to have inputs on this question from you.
Thank you very much