Tracking and resuming DVC experiments with checkpoints

Hello!

I am trying to setup experiment with DVC that is tracking checkpoints and is able to resume from checkpoints. I am a little bit struggling documentation wise as there seems to be little current documentation on the topic. I am not sure whether it is necessary to label the outuput folder in dvc.yaml with checkpoints true with DVC 3.0 explicitly (I assume so).

So configuring dvc.yaml as:

stages:
  train:
    cmd: python train.py
    deps:
      - data/
      - train.py
    params:
      - train
    outs:
      - model_weights/
        checkpoint: true

And then running dvc exp run should pickup the model_weights folder with the latest checkpoints. Is that correct? Or is there other way that is preferred with DVC?

Thanks.

Hey @michalb! Unfortunately, checkpoints have not been supported anymore since DVC 3.0.

You still may work with checkpoints, but you need to implement the logic of saving/resuming in Python code. Here is an approach to it: GitHub - mnrozhkov/checkpoints-gcp

Please take a look and let us know if it works for you. Looking forward to hearing from you! :slight_smile:

Mikhail

1 Like