In my current project it is very important to do leave one out experiments. To test if our trained networks are applicable to previously unknown datasets.
For this I have created the following dvc pipeline.
<params.yaml>
train_datasets:
- abc
- efg
test_datasets:
- xyz
<dvc.yaml>
prepare_train_data:
matrix:
dataset: ${train_datasets}
cmd: ...
outs:
- train_data/${item.dataset}
train_model:
deps:
- train_data
cmd: ...
outs:
- saved_model
test_model:
matrix:
dataset: ${test_datasets}
deps:
- saved_model
cmd: ...
outs: ...
In practice preparation and testing consist of more substeps but this should suffice for now.
In practice the set of train_datasets changes regularly, single datasets are added or removed, while data preparation is time-consuming and recreating the individual preparations for each dataset from cache is extremely helpful.
With this simple pipeline I faced the following challenges:
Challenge 1:
If I remove a dataset from “train_datasets” the outputs of data preparation remain in the “train_data” folder, as the corresponding templated stage is no longer run and thus outputs are not removed. Thus the hash of the “train_data” folder changes and model training is rerun unnecessarily.
Thus I introduced an additional “clear_train_data” stage at the beginning of my pipeline:
clear_train_data:
cmd: rm -rf train_data
always_changed: true
This works fine as long as I use a simple “dvc repro” call to execute my pipeline.
Challenge 2:
However if I provide a target
dvc repro train_model
The clear_data_stage is no longer executed as it is not tied to the dvc dag of the train_model stage.
(I guess this is intended. Although I have to admit that the documentation is not explicit here. At first I understood that --always-changed stages are executed upon any dvc repro call)
So I adjusted my clean_train_data stage to attach it to the dvc dag:
clear_train_data:
cmd: rm -rf train_data && mkdir train_data && touch train_data/cleaned.txt
outs:
- train_data/cleaned.txt
always_changed: true
I do not like this solution but at least now “clear_train_data” is run even if I use “dvc repro train_model”
Challenge 3:
Even if “clear_train_data” is the first stage of my pipeline upon “dvc repro [targets]” is not always executed first. So train data might be added to “train_data” repo and later on cleaned by the “clean_train_data” stage.
So I added another artificial dependency to prepare_train_data
prepare_train_data:
matrix:
dataset: ${train_datasets}
deps:
- train_data/cleaned.txt
cmd: ...
outs:
- train_data/${item.dataset}
This seems to work for now. But it is clearly no straight forward solution.
Thus I would like to ask if you have any other suggestions how to setup my pipeline?