Same output for different pipelines

andboy · July 16, 2024, 8:26am

I want to have one dvclive folder for evaluating the dataset (for example dvclive_dataset_name_eval) that different pipes can write to. The motivation is that I can have different pipes for one dataset (for example one with classic ml models and another with deep learning ones), but eval metrics will be in the same format. That way, I can easily compare models from different pipes. But dvc doesn’t allow this behavior, because different stages can’t have the same output. Is there a way to avoid that?

dberenbaum · July 16, 2024, 1:17pm

No, unfortunately there’s no way to have a dvc pipeline with multiple stages having the same output. One way to handle this would be to use different branches instead of different folders for the different models/pipes.

billtubbs · August 15, 2024, 5:37pm

What’s the reason for this limitation?

I would think my use case is quite common. Most of the data science pipelines I set up before dvc came along follow this pattern:

dvc.yaml

stages:
  fit_model:
    foreach: ${models_to_fit}
    do:
      cmd: python stages/fit_model.py "${key}"
      deps:
        - ${item.input_data_dir}
        - stages/fit_model.py
      params:
        - fit_model
      outs:
        - fitted_models/${key}/model_params.yaml
        - fitted_models/summary.parquet

I get the exception:

ERROR: output 'fitted_models/summary.parquet' is specified in:
        - fit_model@model1
        - fit_model@model2

In my case, summary.parquet is a dataframe containing the results of all the model fitting experiments. I.e. each fitting step adds one result (row) to the dataframe.

Is there a fundamental reason why the DAG can only branch and not join?

billtubbs · August 15, 2024, 6:11pm

I suppose you can just add another stage to do the summarization step:

  summarize_fit_results:
    cmd: python stages/summarize_fit_results.py
    deps:
      - fitted_models
      - stages/summarize_fit_results.py
    outs:
      - results/fit_results_summary.parquet

But it’s not ideal because you have to duplicate all the contents of fit_results_summary.parquet in each fitted_models sub folder.

I’m still curious why the common output file is not an option.

dberenbaum · August 16, 2024, 12:25pm

A common output is not allowed because it causes problems with the DAG. If you were to run a single model, DVC doesn’t know how that summary file is expected to be updated (by adding or replacing one line in the file). This would require DVC to be aware of changes at a sub-file level which is beyond the scope of DVC. As you note, the typical pattern is to have a summary stage at the end.

billtubbs · August 18, 2024, 4:58pm

Why does DVC need to know how the file was updated? Surely it is sufficient to know that it could be updated? I thought DVC simply checks whether an output file has changed or not.

dberenbaum · August 19, 2024, 12:36pm

DVC expects that each stage works independently, and it checks whether the dependencies or outputs have changed to determine whether to re-run the stage. If the summary has changed, DVC doesn’t know which models need to be re-run. If a single model is re-run, it will change the summary, and all of the other model stages would be considered changed and have to re-run. This is not only suboptimal but also circular logic.

If you always run all models together, you could consider condensing into a single stage. If you plan to run models individually, the downstream summary stage is the way to go.

billtubbs · August 19, 2024, 8:54pm

I’m sorry but I don’t follow what you’re saying. In my mind only changes to dependencies determine whether a stage needs to be rerun.

andboy · November 22, 2024, 8:05am

My solution right now is to use .dvcignore to ignore all pipes except one I’m currently running

Topic		Replies	Views
Update same output dir in different stages Questions	4	2497	January 18, 2021
Managing pipelines operating per dataset element Questions	6	1651	January 13, 2021
Versioning predictions Questions	7	954	February 10, 2021
140-stage DVC pipeline getting hard to work with Questions	2	328	June 1, 2023
Pipline with many parallel stages Questions	3	878	September 27, 2021

Same output for different pipelines

Related topics