Managing pipelines operating per dataset element

Np.

I see what you mean. This may not help in answering your practical question but that’s not exactly the case. The only assumption that DVC makes about data is that is file-based (i.e. we don’t provide database versioning or try to understand file formats). In fact you can pull/push/import/etc. specific files inside directories tracked as a whole (we call this “target granularity”), so they’re not opaque.

I think that this is more about how files can be grouped. DVC currently only supports files grouped in directories. Hopefully soon we will support the use of wildcards in paths too, which should cover most use cases via file name patterns (please upvote those issues if you would like that).

So unfortunately, now that I think about this, I don’t think [c] will work because in order for DVC to cache files for versioning, they do need to be specified as outputs in the stage, and again, overlapping output paths are not supported. So :1st_place_medal: after all goes to [a] restructure the dataset so each stage “owns” one+ directories.

But what about introducing a 3rd directory level to keep the best of both worlds? Example:

medimg
├── imports # base images in a special dir
│   ├── patient1.jpg
│   ├── patient2.jpg
│   ├── ...
├── subj1 # subject dirs are shared but sliced:
│   ├── 1-clean # owned by preprocessing stage
│   │   ├── patient1-cln.jpg # artifacts in 3rd level
│   │   ├── ...
│   ├── 2-feat  # owned by featurization stage
│   │   ├── patientn-ft.jpg # artifacts in 3rd level
│   └── 3-train/
├── subj2
    ├── 1-clean/
    ├── 2-feat/
    └── 3-train/
 ...

The downside is that you still need to specify multiple outs and deps per stage (and figure out all these paths inside your code) but it’s only one per subject, which I imagine is much more manageable. E.g.:

# dvc.yaml
stages:
  preprocess:
    cmd: ...
    deps:
    - medimg/imports
    outs:
    - medimg/subj1/1-clean
    - medimg/subj2/1-clean
    - ...
  featurize:
    cmd: ...
    deps: # preprocess outs
    - medimg/subj1/1-clean
    - medimg/subj2/1-clean
    - ...
    outs:
    - medimg/subj1/2-feat
    ...

And yes, [d] is definitely worth trying too.