Update same output dir in different stages

Hi, I have a pipeline with many stages that write the output in different subdirectories of the same dir.
So basically each stage has as input a dir like this:

  • input-dir
    • file1
    • file2
    • fileN

The nth stage writes its output:

  • output-dir (same for all the stages)
    • results-file1
      • output-stage1
      • output-stage2
      • output-stageN
    • results-file2
      .
      .
      .

Actually, results-file* is a zarr group (Groups (zarr.hierarchy) — zarr 2.6.1 documentation), whereas each stage appends new arrays (that correspond to output-stage* dirs).
Is there a way to achieve these with dvc? the command dvc run complains the output is already specified in another stage.

Thanks in advance

Hi @mauro,

Your case is similar to a recent question which you can find here: Managing pipelines operating per dataset element. It lists some options to workaround the current overlapping output limitation of DVC (with regular files/dirs).

In your case the problem is harder for DVC though, because we work at the file level and have no understanding of internal data formats. So DVC can’t see what’s inside results-file1*, much less track only certain parts of that file per stage.

Is there perhaps any way to create a “view” of a Zarr file that the file system sees as regular dirs and files? Perhaps using symlinks? If so, that could be what DVC tracks (as a proxy to results-file*.

Related issues you may want to read and perhaps participate in:

Actually results-file* and output-stage* are regular directories. Anyway it is a specific detail of how Zarr stores array groups, so in order to be agnostic of the backend used I am considering the option a summary file as suggested in other discussions.
Are there some examples of how use such file as alternative output of a stage?

1 Like

@mauro, if they are separate output, we have support for foreach loops in the dvc.yaml file, though you have to provide input and output names yourself. The feature is stable, but not released yet (will be released in 2.0).

It looks something similar to the following:

stages:
   stage_name:
    foreach:
       - 1
       - 2
    do:
        cmd: # do something
        deps:
          - input-dir/file${item}  # the each iteration value is available in `${item}` expression
        outs:
          - output-dir/results-file1/output-stage${item}
          - output-dir/results-file2/output-stage${item}

Please take a look at “generating stages”. You can also keep the foreach data separately in the params.yaml file next to this dvc.yaml file, it will be read by-default.

And, to install the unreleased version, please take a look at this installation instructions.

1 Like

That’s a very interesting feature, thanks! I will try to find a solution with it.