DVC and Hydra integration


I want to integrate DVC pipelines and parameters/metrics tracking with Hydra loggings and configs. What best practices could I check?

Some questions:

  • how to configure hydra outputs dir to normal save logs and artifacts from the different pipeline stages?
  • how to connect dvc params.yaml with hydra conf?

Hi @yisaienkov !

There is not an out-of-the-box integration between DVC Pipelines and Hydra, yet. We are currently exploring what improvements could be made in that regard. Don’t hesitate on commenting your opinions or requests on that discussion or new issues.

As best practices that can be recommended now, it would depend on how you are using hydra within the stages, any additional info about your pipeline would be helpful. Regardless, here are some thoughts, assuming a basic Hydra app like the one used in the tutorial:

Regarding “hydra outputs” I would override hydra.output_subdir to None and use DVC outputs` as you would usually do without Hydra. You don’t really need to use the subfolder date-based versioning that Hydra provides as DVC+Git will do the proper versioning for you.

Regarding “hydra config”, I would suggest tracking the parent config directory as a DVC dependency.

Putting all together for this sample application the dvc.yaml would be:

    cmd: python my_app.py ${hydra_args}
      - conf
      - my_app_output

And params.yaml:

hydra_args: "+db=postgresql db.timeout=20"

You would just run this with dvc repro and modify params.yaml to pass other args to hydra.

As said in the beginning, these are just some workarounds for a very simple app. If you are wiling to share more details, we can discuss your use case and see what else can be done.

Thank you @daavoo !
It’s a good example and explanation!

I will go deeper into this topic and if I will have more questions I will ask :upside_down_face: