Tracking and resuming DVC experiments with checkpoints

michalb · March 18, 2024, 11:08am

Hello!

I am trying to setup experiment with DVC that is tracking checkpoints and is able to resume from checkpoints. I am a little bit struggling documentation wise as there seems to be little current documentation on the topic. I am not sure whether it is necessary to label the outuput folder in dvc.yaml with checkpoints true with DVC 3.0 explicitly (I assume so).

So configuring dvc.yaml as:

stages:
  train:
    cmd: python train.py
    deps:
      - data/
      - train.py
    params:
      - train
    outs:
      - model_weights/
        checkpoint: true

And then running dvc exp run should pickup the model_weights folder with the latest checkpoints. Is that correct? Or is there other way that is preferred with DVC?

Thanks.

mikhail · March 18, 2024, 1:30pm

Hey @michalb! Unfortunately, checkpoints have not been supported anymore since DVC 3.0.

You still may work with checkpoints, but you need to implement the logic of saving/resuming in Python code. Here is an approach to it: GitHub - mnrozhkov/checkpoints-gcp

Please take a look and let us know if it works for you. Looking forward to hearing from you!

Mikhail

Topic		Replies	Views
Experiments with checkpoint Questions	8	938	May 13, 2021
How to handle dynamic number of outputs? Questions	1	300	July 20, 2023
Failed to reproduce 'train': output 'model.pt' does not exist Questions	0	375	February 13, 2023
DVClive + MMCV in Container Questions	11	933	July 22, 2022
Saving repo snapshot Questions	1	28	July 23, 2024

Tracking and resuming DVC experiments with checkpoints

Related topics