Need guidance with a use-case of benchmarking models

Dear DVC community, I need guidance with setting up DVC for an obscure use case. Most examples I could find on the Internet focus on a single final model and simple metrics. My project aims to benchmark the performance of several finetuned ML models (scikit-learn and PyTorch) on about four datasets, and several specific train-test cross-validation splits. The evaluation step produces four metrics and combines them into a summary to observe some extra characteristics of each scenario (dataset - model - split). See the example below:

{
  "rmse": {
    "iqr": 0.5864169293737658,
    "max": 3.841769718270486,
    "mean": 0.8032393243628008,
    "median": 0.6945700715137781,
    "min": 0.04119513386612043,
    "std": 0.5821376148739884
  }, ...
}

My WIP project currently resides here. Currently has only one dataset, and produced around 20 reports. I will probably need to remove/hide it later. It will be public again, after publication.

  • How should I structure the project in such a case? What are good practices?
  • Every dataset is independent, and pipelines for each dataset will only share scripts at the final stages (i.e., training and evaluation).
    • Can a pipeline be split into separate files? I noticed that DVC picks only dvc.yaml file.
  • How should I structure the content of so many reports/metrics to be more easily picked by dvc metrics, and VSCode plugin while developing project?

Thanks in advance!

The DVC pipeline file has to be named dvc.yaml, but a DVC project can have multiple pipelines as long as they are organized in to separate directories like:

.
├── .dvc
├── pipeline1
│   └── dvc.yaml
└── pipeline2
    └── dvc.yaml

If you organize your project this way you can run specific pipelines with

dvc exp run path/to/specific/dvc.yaml

(or dvc repro path/to/specfic/dvc.yaml)

Thanks, @pmrowla; I will try to organize it this way.