Integrating MLflow and DVC for Robust Machine Learning Lifecycle Management

daniel.castillo99 · October 6, 2025, 6:33pm

Hello community,

I would like to open a discussion on comprehensive management of experiments, models, and data in real-world machine learning projects using DVC and MLflow. Both systems offer features that sometimes overlap but also have complementary strengths.

Background

MLflow excels at experiment tracking, allowing registration of runs, storing parameters, metrics, artifacts, and models in an organized structure (e.g., the mlruns folder with JSON files describing executions). It facilitates comparison and lifecycle management of models through a friendly UI.
DVC focuses on versioning large datasets and models at the storage level, integrating seamlessly with Git via .dvc files. It also supports reproducible pipelines to automate workflows.

Questions and Practical Approach in Real Projects

In a real project, would it be better to use MLflow and DVC together or rely on only one? What I’ve found:

MLflow’s mlruns folder holds rich metadata about experiment executions but is not a robust version control system for large datasets or heavy models.
DVC shines in controlling versions and synchronizing datasets and models through .dvc files, making storage and management efficient.

Therefore, combining them allows:

DVC to manage versioning and synchronized storage of heavy data and models, ensuring reproducibility at the data level.
MLflow to handle detailed experiment tracking, logging, and comparison of metrics, parameters, and artifacts with a rich visual interface.

Typical Integrated Project Structure

/ml-project
│
├── data/                 # Data versioned by DVC
│   ├── raw/
│   ├── processed/
│   └── dataset.dvc       # DVC file tracking dataset version
│
├── models/               # Models versioned by DVC
│   └── model.pkl.dvc
│
├── mlruns/               # MLflow folder with experiment records
│
├── dvc.yaml              # Reproducible pipelines defined in DVC
├── .dvc/                 # DVC configuration
├── code/                 # Source code and scripts
│
└── README.md

Discussion Points for the Community

What has been your experience using both together vs. just one?
What methodologies or best practices do you recommend for integrating these tools in projects of different scales?
How do you coordinate data/model versioning with experiment tracking effectively?
Have you encountered challenges in syncing, scalability, or traceability?

I look forward to your recommendations, best practices, or references to methodologies that could help refine a robust, reproducible, and collaborative ML workflow using DVC and MLflow.

Thank you in advance for sharing your insights!

Topic		Replies	Views
DVC and MLFlow - reproduce experiments using git commit ids Questions	14	5729	February 18, 2021
DVC compared with GitLFS for storage and versioning only Questions	12	6993	October 13, 2020
DVC and data lake Questions	1	1380	September 12, 2019
Packaging data and machine learning models for sharing Blog Comments	0	3304	June 28, 2020
How to track experiment metrics across different machine learning models? Questions	1	130	June 21, 2024

Integrating MLflow and DVC for Robust Machine Learning Lifecycle Management

Background

Questions and Practical Approach in Real Projects

Typical Integrated Project Structure

Discussion Points for the Community

Related topics