Challenges with non-standard DVC Pipeline

In my current project it is very important to do leave one out experiments. To test if our trained networks are applicable to previously unknown datasets.

For this I have created the following dvc pipeline.

<params.yaml>

train_datasets:
  - abc
  - efg

test_datasets:
  - xyz

<dvc.yaml>

prepare_train_data:
  matrix:
    dataset: ${train_datasets}
  cmd: ...
  outs:
    - train_data/${item.dataset}

train_model:
  deps:
    - train_data
  cmd: ...
  outs:
    - saved_model

test_model:
  matrix:
    dataset: ${test_datasets}
  deps:
    - saved_model
  cmd: ...
  outs: ...

In practice preparation and testing consist of more substeps but this should suffice for now.
In practice the set of train_datasets changes regularly, single datasets are added or removed, while data preparation is time-consuming and recreating the individual preparations for each dataset from cache is extremely helpful.

With this simple pipeline I faced the following challenges:

Challenge 1:
If I remove a dataset from “train_datasets” the outputs of data preparation remain in the “train_data” folder, as the corresponding templated stage is no longer run and thus outputs are not removed. Thus the hash of the “train_data” folder changes and model training is rerun unnecessarily.

Thus I introduced an additional “clear_train_data” stage at the beginning of my pipeline:

clear_train_data:
  cmd: rm -rf train_data
  always_changed: true

This works fine as long as I use a simple “dvc repro” call to execute my pipeline.

Challenge 2:
However if I provide a target

dvc repro train_model

The clear_data_stage is no longer executed as it is not tied to the dvc dag of the train_model stage.
(I guess this is intended. Although I have to admit that the documentation is not explicit here. At first I understood that --always-changed stages are executed upon any dvc repro call)

So I adjusted my clean_train_data stage to attach it to the dvc dag:

clear_train_data:
  cmd: rm -rf train_data && mkdir train_data && touch train_data/cleaned.txt
  outs:
     - train_data/cleaned.txt
  always_changed: true

I do not like this solution but at least now “clear_train_data” is run even if I use “dvc repro train_model”

Challenge 3:
Even if “clear_train_data” is the first stage of my pipeline upon “dvc repro [targets]” is not always executed first. So train data might be added to “train_data” repo and later on cleaned by the “clean_train_data” stage.

So I added another artificial dependency to prepare_train_data

prepare_train_data:
  matrix:
    dataset: ${train_datasets}
  deps:
    - train_data/cleaned.txt
  cmd: ...
  outs:
    - train_data/${item.dataset}

This seems to work for now. But it is clearly no straight forward solution.

Thus I would like to ask if you have any other suggestions how to setup my pipeline?

As you mention, it’s not straightforward, but it makes sense for your setup.

Thank you very much for the feedback. I greatly appreciate the confirmation from a more experienced developer, regarding this pipeline configuration.