Hello! I would like to understand the appropriate way to manage pipeline stages that are not “sliced” stage-wise, but rather dataset element-wise.
For example, suppose a medical imaging dataset comprises different subjects, each of whom has a collection of files associated to them in a subject-specific directory. Such organization is convenient for manual inspection. Initially, only an image exists for each patient, which is located in S3 and obtained using dvc import-url. The to-be-written DVC pipeline will produce additional files (preprocessing, etc.) in each subject’s directory.
My current understanding from DVC’s (excellent) documentation is there are potentially several ways to accomplish this:
[a] Reshape the dataset such that subjects are distributed across multiple stage directories, which are given as dep or out to the pipeline.
[b] List every file as dep or out.
[c] Decouple stages from directory structure by having each stage depend on a “summary” or “success” file from previous stages instead of a subject’s files or a stage directory.
[d] Introduce a “summary” stage which creates the desired directory structure view, e.g. via symlinks.
Option [a] discards the desired directory structure, which is extremely undesirable since manual inspection of subjects’ intermediate outputs is common and it is convenient to have all outputs collocated. Option [b] allows to maintain desired directory structure but is unwieldy and perhaps not performant(?). Option [c] allows to maintain desired directory structure but requires error prone user implementation (e.g. the summary file must reflect the unique combination of files it summarizes, which somewhat reinvents what DVC already does for tracking a directory). Option [d] allows to maintain desired directory structure as a view of another structure DVC likes more, but produces outputs that should not be tracked.
Are there perhaps other options possible? Thank you!
Well, I’m not sure what you mean by “stages that are not sliced stage-wise”. Do you mean most stages use the same dependency (base image) — and could happen in parallel? Because that is OK, pipelines don’t need to be serial, any graph is supported (examples here and here).
But please lmk if you had a different impression from the docs! We could def. improve that.
In any case, yes: because DVC doesn’t support overlapping output paths (here’s why), stages can’t share an output directory as a whole.
The simplest workarounds for that are options [a] and [b] as you already identified But [b] becomes undesirable as the number of files grows, which I guess I how you came up with workarounds [c] and [d] (great ideas BTW). I’m not sure about [d] though, since DVC deletes output files every time repro starts, and already uses links (depending on your file system and cache.type config).
That leaves us with between
[a] change your project’s structure (but you don’t want that); or
[c] generate an intermediate file listing what the next stage should read (which would be the formal dep of that stage)
The notion of stage-wise vs element-wise is an allusion to DVC’s view of data as either a collection of files or a single opaque blob (the containing directory). There doesn’t seem (yet?) to be support for any intermediate view. For example, one cannot track a directory and list a file in that directory as a dep or ‘out’(?). And indeed, the difficulty of overlapping output paths is clear (as you noted in this github issue).
I generally agree with your assessments of the different options.
For option [c], I guess one gotcha is that the list must also contain hashes of the files.
Regarding [d], I imagined that the pipeline would be constructed as deps between directories (one of the preferred DVC use cases seen in the docs), and that the final “summary” stage would depend on all previous stages’ directories and create a collated view of the content therein, according to the desired directory structure.
It seems no matter what, a long list of deps is unavoidable, since the dataset itself is constructed via many import-url commands. Depending on which of the above solutions is used, this list may need only to be in a single location.
I definitely appreciate any further thoughts you might have!
I see what you mean. This may not help in answering your practical question but that’s not exactly the case. The only assumption that DVC makes about data is that is file-based (i.e. we don’t provide database versioning or try to understand file formats). In fact you can pull/push/import/etc. specific files inside directories tracked as a whole (we call this “target granularity”), so they’re not opaque.
I think that this is more about how files can be grouped. DVC currently only supports files grouped in directories. Hopefully soon we will support the use of wildcards in paths too, which should cover most use cases via file name patterns (please upvote those issues if you would like that).
So unfortunately, now that I think about this, I don’t think [c] will work because in order for DVC to cache files for versioning, they do need to be specified as outputs in the stage, and again, overlapping output paths are not supported. So after all goes to [a] restructure the dataset so each stage “owns” one+ directories.
But what about introducing a 3rd directory level to keep the best of both worlds? Example:
├── imports # base images in a special dir
│ ├── patient1.jpg
│ ├── patient2.jpg
│ ├── ...
├── subj1 # subject dirs are shared but sliced:
│ ├── 1-clean # owned by preprocessing stage
│ │ ├── patient1-cln.jpg # artifacts in 3rd level
│ │ ├── ...
│ ├── 2-feat # owned by featurization stage
│ │ ├── patientn-ft.jpg # artifacts in 3rd level
│ └── 3-train/
The downside is that you still need to specify multiple outs and deps per stage (and figure out all these paths inside your code) but it’s only one per subject, which I imagine is much more manageable. E.g.:
I have recently started to use dvc on a relatively large project and I am very happy with it, except for the problem that is addressed in this thread. Indeed, as a design pattern for data management, I am also relying on the principle of a “directory per patient”. The principle is that (1) every “patient” directory must contains a predefined set of mandatory files; (2) over time as the project progresses, result files from additional/new analysis can be added to every directory; (3) simultaneously over time as the project progresses, new patients can be added, starting with the mandatory files. Scripts written to update or populate this repository all work based on the principle of looking for the missing files and generating them if required. This permits the addition of a new analysis or a new patient at any time, without having to recompute the whole dataset. This strategy scales relatively well and its parallelisation is also relatively trivial IMHO. But to date, I have not succeeded in deploying this strategy into the stage logic currently provided by dvc. Something is missing, as for example a template yaml file that will serve to generate an up-to-date dvc.yaml file in every directory. Just an idea …
Hmmm. I just realized maybe subjects = patients in @ekhahniii’s comments (before I thought they referred to areas of study or something like that — a small number of them would exist).
│ ├── base files ...
│ ├── 1-clean # dir/file owned by preprocessing stage
│ ├── 2-feat # owned by featurization stage
│ ... # etc.
├── subj2 ... # repeat
Right? Where n may be pretty large. Hmmm… So again I must conclude the only current workaround is to [a] restructure the structure by introducing intermediate directories, one per stage. And optionally also [d] to add a stage that produces a (symlinked?) view of that underlying structure in the desired format (kind of hacky though).
But there may be upcoming features to solve this (will explain in another comment)…
We are about to introduce some templating for dvc.yaml in fact, as you can see already published here: https://dvc.org/doc/user-guide/dvc-files/advanced-dvc-yaml . As it is now you could use it as a workaroud for this too, but you would need to generate/managa a list of outputs/deps (in a YAML “parameters file”) and load them as global vars to iterate on in a foreach multi-stage. Probably too involved, but doable. Please take a look and feel free to recommend improvements!