DVC and MLFlow - reproduce experiments using git commit ids

Hello! I’ve started using DVC and I love it, thank you for your work!

I have a question regarding DVC and MLFlow combination. I hope someone can help me with that.
I am using DVC to build/run pipelines and version data and models. And I am using MLFlow to have a nice overview of all the experiments and to visualize/store metrics and plots of the experiments.
During the training and evaluation stages, I am logging stuff to MLFlow (including current git commit id for reproducibility).

Let’s say I want to experiment with a different learning rate.
My actions are:

  • I am updating the learning rate in params.yaml
  • git commit -m 'starting a new run with updated learning rate'
  • dvc repro (new MLFlow run with metrics and current git commit id is created)
  • I check the results and I like them. I want to save the model.
  • git add 'dvc.lock'
  • git commit -m 'awesome learning rate'
  • dvc push
  • git push

So I have 2 commits here. Commit A before the run and commit B after.

Let’s say I checked my MLFlow dashboard and I want to get the trained model of the last run.
It is linked to the commit A.
If I git checkout commit A, I won’t get the trained model until I dvc repro again.
And I can’t log commit B to the MLFlow run during training/evaluation because I haven’t created commit B yet.

Can’t really see the whole picture… How to do it properly?
Any ideas would be very helpful!

Thanks in advance

2 Likes

Hi @al.ponomar, that’s a great question.
DVC + MLflow (or other experiment loggers) is a quite common use case.

The first commit is not required by DVC. You can change params and/or code, do repro and then commit all the code and models together if you like the result. However, the 1st commit is required by MlFlow if you’d like to attach a proper commit to the metrics. You can mitigate the issue by reporting params in mlflow but it might explode the number of entities you report - not everyone likes this idea.

dvc checkout will get you the model. You can set up Git hooks to do that automatically https://dvc.org/doc/command-reference/install#installed-git-hooks

Your workflow makes sense except the 1st commit that is required by mlflow - people tend to avoid doing that.

PS: We are working on the experimentation experience improvements. Two big changes are expected in DVC 2.0 (in a month or two):

  1. DVC experiments without explicit commit - https://github.com/iterative/dvc/wiki/Experiments. Plus model checkpoints - mostly for deep learning.
  2. Integrations with metrics loggers (it is in a closed repository for now).
2 Likes

Wow, what a fast and elaborate response!
Thank you a lot, very helpful.

Looking forward to DVC 2.0!

Meanwhile, I was also thinking to add commit id B to the MLFow run after the experiment.
I can log to the right experient using mlflow_run_id.
The file with mlflow_run_id is an output of the training pipeline stage and tracked by dvc.
So after dvc repro is done and I commit my results, I can run a short script that will add the current commit id (commit B) to the right MLFlow run (mlflow_run_id).

1 Like

:slight_smile: We will be doing pre-release pretty soon. I can include you in the beta users program if you are up to it. I’ll just need a contact - email or your name in our Discord chat.

Good idea. It might work well.

2 Likes

Amazing! I’ll let you know when it’s ready for beta testing.

1 Like

Integrations with metrics loggers sounds very useful, esp. if this would enable a way to plugin any ML logging/tracking system such as mlflow, guildai, aim, wandb, mlmd etc. Is this code going to be moved to a git branch in the dvc open source tree, or will this remain closed source? It is nice to have the bits in an experimental branch when the feature is being developed.

Hi :slight_smile: Logger project will likely live in a separate git repo(we are not 100% sure yet), but it will be open-source, same as dvc.

Glad to hear that. The open source “by the community, for the community” aspect is one of the things that I find so exciting about the way DVC is evolving :slight_smile:
Integration with metrics loggers would make it so much easier to enable DVC naturally in those settings. Look forward to it and curious to learn more about the design, which loggers are covered etc. I can see why a separate repo might make sense if not all users of DVC need it. I guess there may be some dependencies to sort out.

3 Likes

Is this available now and open for contributions? From a quick look at the DVC 2.0 release plan I don’t see this project mentioned, so I am guessing this is still in another repository.

Hi @suparna, live is still not released, though it shouldn’t be much longer before we release it. DVC 2.0 will support it. It is not mentioned in release plan, because it will be standalone project, though DVC will have proper integration with it.

@al.ponomar @suparna @suparna.bhattacharya we have released the logger.
Prerelease version is avialable under:

Stay tuned, for release, which is planned to go together with DVC 2.0 release.

Thanks! Is the code available on github? How does one integrate this with MLflow and other tools which have their own logging APIs and UIs for analysis and visualization? Or lets say we are using Kubeflow with MLMD?

@suparna

Is the code available on github

Not yet, though we will probably release it soon.

So, dvclive is intending to replace MLFlow, rather than integrate with it. Its a metric logger producing outputs understandable by dvc. So, the metrics logs produced by dvclive can be used for example in dvc plots command.

Currently neither dvc nor dvclive integrate with MLFlow or MLMD.

Hmm … I am afraid that wasn’t what I was expecting. There are way too many open source metrics logging tools/APIs in use as it is mlflow, guildai, mlmd, aim, not mention other proprietary tools like wandb. I was looking forward to a design that would integrate with these ecosystems not more parallel universes to choose from.

@suparna I understand your concerns. The thing is that the tools that you mentioned (MLFlow, wandb) are tools that, in a way, aim to solve similar problem DVC tries to solve: versioning your ML project and training pipelines. dvclive is library created to provide monitoring for the training loop (similar to mlflow and wandb), while not requiring user to run some kind of a server on training machine (contrary to both of them). Integrating DVC with wandb or mlflow would be hard if not impossible due to different approach those tools use for versioning - DVC is a CLI tool, while wandb and mlflow are server apps.

What is your use case? How would you see the integration between DVC and, for example, wandb?