Challenges with non-standard DVC Pipeline

Ann · June 18, 2024, 9:43am

In my current project it is very important to do leave one out experiments. To test if our trained networks are applicable to previously unknown datasets.

For this I have created the following dvc pipeline.

<params.yaml>

train_datasets:
  - abc
  - efg

test_datasets:
  - xyz

<dvc.yaml>

prepare_train_data:
  matrix:
    dataset: ${train_datasets}
  cmd: ...
  outs:
    - train_data/${item.dataset}

train_model:
  deps:
    - train_data
  cmd: ...
  outs:
    - saved_model

test_model:
  matrix:
    dataset: ${test_datasets}
  deps:
    - saved_model
  cmd: ...
  outs: ...

In practice preparation and testing consist of more substeps but this should suffice for now.
In practice the set of train_datasets changes regularly, single datasets are added or removed, while data preparation is time-consuming and recreating the individual preparations for each dataset from cache is extremely helpful.

With this simple pipeline I faced the following challenges:

Challenge 1:
If I remove a dataset from “train_datasets” the outputs of data preparation remain in the “train_data” folder, as the corresponding templated stage is no longer run and thus outputs are not removed. Thus the hash of the “train_data” folder changes and model training is rerun unnecessarily.

Thus I introduced an additional “clear_train_data” stage at the beginning of my pipeline:

clear_train_data:
  cmd: rm -rf train_data
  always_changed: true

This works fine as long as I use a simple “dvc repro” call to execute my pipeline.

Challenge 2:
However if I provide a target

dvc repro train_model

The clear_data_stage is no longer executed as it is not tied to the dvc dag of the train_model stage.
(I guess this is intended. Although I have to admit that the documentation is not explicit here. At first I understood that --always-changed stages are executed upon any dvc repro call)

So I adjusted my clean_train_data stage to attach it to the dvc dag:

clear_train_data:
  cmd: rm -rf train_data && mkdir train_data && touch train_data/cleaned.txt
  outs:
     - train_data/cleaned.txt
  always_changed: true

I do not like this solution but at least now “clear_train_data” is run even if I use “dvc repro train_model”

Challenge 3:
Even if “clear_train_data” is the first stage of my pipeline upon “dvc repro [targets]” is not always executed first. So train data might be added to “train_data” repo and later on cleaned by the “clean_train_data” stage.

So I added another artificial dependency to prepare_train_data

prepare_train_data:
  matrix:
    dataset: ${train_datasets}
  deps:
    - train_data/cleaned.txt
  cmd: ...
  outs:
    - train_data/${item.dataset}

This seems to work for now. But it is clearly no straight forward solution.

Thus I would like to ask if you have any other suggestions how to setup my pipeline?

dberenbaum · June 20, 2024, 12:22pm

As you mention, it’s not straightforward, but it makes sense for your setup.

Ann · June 21, 2024, 12:19pm

Thank you very much for the feedback. I greatly appreciate the confirmation from a more experienced developer, regarding this pipeline configuration.

Topic		Replies	Views
Processing data in place Questions	8	1424	April 29, 2020
Experiment duplicates Questions	0	267	April 18, 2023
Best Practice - Test pipeline with smaller dataset? Questions	1	518	February 8, 2022
Environment support Feature Requests	7	1312	June 8, 2021
Temporary large datasets in DVC Questions	2	18	February 12, 2025

Challenges with non-standard DVC Pipeline

Related topics