Same output for different pipelines

I want to have one dvclive folder for evaluating the dataset (for example dvclive_dataset_name_eval) that different pipes can write to. The motivation is that I can have different pipes for one dataset (for example one with classic ml models and another with deep learning ones), but eval metrics will be in the same format. That way, I can easily compare models from different pipes. But dvc doesn’t allow this behavior, because different stages can’t have the same output. Is there a way to avoid that?

1 Like

No, unfortunately there’s no way to have a dvc pipeline with multiple stages having the same output. One way to handle this would be to use different branches instead of different folders for the different models/pipes.

1 Like

What’s the reason for this limitation?

I would think my use case is quite common. Most of the data science pipelines I set up before dvc came along follow this pattern:

dvc.yaml

stages:
  fit_model:
    foreach: ${models_to_fit}
    do:
      cmd: python stages/fit_model.py "${key}"
      deps:
        - ${item.input_data_dir}
        - stages/fit_model.py
      params:
        - fit_model
      outs:
        - fitted_models/${key}/model_params.yaml
        - fitted_models/summary.parquet

I get the exception:

ERROR: output 'fitted_models/summary.parquet' is specified in:
        - fit_model@model1
        - fit_model@model2

In my case, summary.parquet is a dataframe containing the results of all the model fitting experiments. I.e. each fitting step adds one result (row) to the dataframe.

Is there a fundamental reason why the DAG can only branch and not join?

I suppose you can just add another stage to do the summarization step:

  summarize_fit_results:
    cmd: python stages/summarize_fit_results.py
    deps:
      - fitted_models
      - stages/summarize_fit_results.py
    outs:
      - results/fit_results_summary.parquet

But it’s not ideal because you have to duplicate all the contents of fit_results_summary.parquet in each fitted_models sub folder.

I’m still curious why the common output file is not an option.

A common output is not allowed because it causes problems with the DAG. If you were to run a single model, DVC doesn’t know how that summary file is expected to be updated (by adding or replacing one line in the file). This would require DVC to be aware of changes at a sub-file level which is beyond the scope of DVC. As you note, the typical pattern is to have a summary stage at the end.

Why does DVC need to know how the file was updated? Surely it is sufficient to know that it could be updated? I thought DVC simply checks whether an output file has changed or not.

DVC expects that each stage works independently, and it checks whether the dependencies or outputs have changed to determine whether to re-run the stage. If the summary has changed, DVC doesn’t know which models need to be re-run. If a single model is re-run, it will change the summary, and all of the other model stages would be considered changed and have to re-run. This is not only suboptimal but also circular logic.

If you always run all models together, you could consider condensing into a single stage. If you plan to run models individually, the downstream summary stage is the way to go.

I’m sorry but I don’t follow what you’re saying. In my mind only changes to dependencies determine whether a stage needs to be rerun.