Hello! I would like to understand the appropriate way to manage pipeline stages that are not “sliced” stage-wise, but rather dataset element-wise.
For example, suppose a medical imaging dataset comprises different subjects, each of whom has a collection of files associated to them in a subject-specific directory. Such organization is convenient for manual inspection. Initially, only an image exists for each patient, which is located in S3 and obtained using dvc import-url
. The to-be-written DVC pipeline will produce additional files (preprocessing, etc.) in each subject’s directory.
My current understanding from DVC’s (excellent) documentation is there are potentially several ways to accomplish this:
[a] Reshape the dataset such that subjects are distributed across multiple stage directories, which are given as dep
or out
to the pipeline.
[b] List every file as dep
or out
.
[c] Decouple stages from directory structure by having each stage depend on a “summary” or “success” file from previous stages instead of a subject’s files or a stage directory.
[d] Introduce a “summary” stage which creates the desired directory structure view, e.g. via symlinks.
Option [a] discards the desired directory structure, which is extremely undesirable since manual inspection of subjects’ intermediate outputs is common and it is convenient to have all outputs collocated. Option [b] allows to maintain desired directory structure but is unwieldy and perhaps not performant(?). Option [c] allows to maintain desired directory structure but requires error prone user implementation (e.g. the summary file must reflect the unique combination of files it summarizes, which somewhat reinvents what DVC already does for tracking a directory). Option [d] allows to maintain desired directory structure as a view of another structure DVC likes more, but produces outputs that should not be tracked.
Are there perhaps other options possible? Thank you!