Experiments with checkpoint

Hello,

So currently, I have a pipeline with a stage “traindetector” which produces a number of checkpoint files (one for each epoch).

After running

dvc exp run

once, my experiment is done.
However, if I do:

dvc status traindetector
traindetector:
always changed

The “traindetector” stage is marked as “always changed”, and if I redo:

dvc exp run

with no modification to the project, the “traindetector” stage will be redone.

Is this normal ?
Could somebody help me better understand experiments with checkpoints ?

1 Like

Great question @kwon-young !

Yes, it’s normal that a stage with checkpoint outputs is considered always_changed (by dvc status and dvc exp run).

If you think about it, a checkpoint output is in effect a circular dependency, since you can run the experiment again and it should continue from the last version of the output. DVC considers the stage always_changed so that you can do that (just exp run without changing anything will execute the stage anyway). Think of it as a “never ending” stage.

Please check out https://dvc.org/doc/user-guide/experiment-management/checkpoints if you haven’t and let us know what specific questions you have.

Can you explain what you’re trying to achieve? Then maybe we can determine whether checkpoint experiments are the right tool.

Thanks!

Thank you very much for your reply.

Isn’t this problematic if you are adding stages that depends on artifacts of the training ?
For example, I need to integrate my trained detector in another model which will also be trained.
If I then launch dvc exp run, will my detector be retrained ?

Also, in https://github.com/iterative/dvc-checkpoints-mnist, the training consists of 10 epoch, but if you stop the training at epoch 5, and restart wouldn’t the model train for 15 epochs ?

1 Like

I’m going to answer in reverse order because I think it will make more sense.

Yes, that’s what will happen. In that example, 10 epochs is intended to be a starting point rather than a set number of epochs. In many cases, a checkpoints script might be an infinite loop, and the user will determine when to stop training based on the trajectory of the metrics. The checkpoints feature is intended for circumstances where you don’t know how long you want to train and want to keep adding iterations (and possibly go back if performance starts getting worse).

If you know the number of epochs you want, I would not suggest using checkpoints. If you want to log the metrics from each epoch, you might want to look at dvclive.

If you are using checkpoints and want to use the model in other DVC stages, you can use dvc exp run --downstream other_model_stage to run your pipeline starting from other_model_stage so that the checkpoints model and upstream stages are not executed. Would that work for you?

Keep in mind that checkpoints are still experimental, and the initial focus was on experimenting with the checkpoints stage more than using checkpoint models in downstream stages. If you want to share more details on your workflow, it could be help future development on how checkpoints fit into a larger DVC pipeline.

1 Like

Thank you very much for your explanation.

My workflow is the following:

  1. import dataset from a data registry and other data from url
  2. preprocessing stage of data from url with output: preprocessing.data
  3. train detector using dataset and processing.data
  • run multiple experiment by varying parameters
  • train each expe for 25 epoch and checkpoint using dvclive
  1. manually select the best performing detector
  2. train another model using dataset, preprocessing.data and best performing detector
  • run multiple expe by varying parameters
  • each each expe for n epochs
  1. manually select best model

I suppose this can work for me.

I better understand this now, thank you. Another use case of checkpoints is to be able to restart the training even if there is some unknown failures, which is my main use case.

Can you explain this in more detail? Are you restarting from scratch or from a specific epoch? For example, your experiment fails after 12 epochs and you want to debug and then run epochs 13-25?

Yes, that’s the idea.
In my institution, we have a supercomputing grid on which we can run special “besteffort” jobs, which can be stopped at any time but can be run on more computing nodes.

1 Like

Ah, so your jobs may be cancelled before they run to completion. In that case, instead of iterating over a constant range of steps/epochs like in dvc-checkpoints-mnist, you want to track the step number by reading in dvclive.json (or whatever path you are using for dvclive) and iterate until that step number is reached, right? That should also allow you to dvc exp run without having to manually run the downstream stages separately.

yes, that is a good idea. I would just have to also add a check at the beginning of my script to see if there are more epoch to be done and if not, quit before doing the long and costly setup of loading data and models.

1 Like