Experiments with checkpoint

kwon-young · May 12, 2021, 12:57pm

Hello,

So currently, I have a pipeline with a stage “traindetector” which produces a number of checkpoint files (one for each epoch).

After running

dvc exp run

once, my experiment is done.
However, if I do:

dvc status traindetector
traindetector:
always changed

The “traindetector” stage is marked as “always changed”, and if I redo:

dvc exp run

with no modification to the project, the “traindetector” stage will be redone.

Is this normal ?
Could somebody help me better understand experiments with checkpoints ?

jorgeorpinel · May 13, 2021, 2:02am

Great question @kwon-young !

Yes, it’s normal that a stage with checkpoint outputs is considered always_changed (by dvc status and dvc exp run).

If you think about it, a checkpoint output is in effect a circular dependency, since you can run the experiment again and it should continue from the last version of the output. DVC considers the stage always_changed so that you can do that (just exp run without changing anything will execute the stage anyway). Think of it as a “never ending” stage.

Please check out Experiment Management if you haven’t and let us know what specific questions you have.

Can you explain what you’re trying to achieve? Then maybe we can determine whether checkpoint experiments are the right tool.

Thanks!

kwon-young · May 13, 2021, 8:57am

Thank you very much for your reply.

Isn’t this problematic if you are adding stages that depends on artifacts of the training ?
For example, I need to integrate my trained detector in another model which will also be trained.
If I then launch dvc exp run, will my detector be retrained ?

Also, in https://github.com/iterative/dvc-checkpoints-mnist, the training consists of 10 epoch, but if you stop the training at epoch 5, and restart wouldn’t the model train for 15 epochs ?

dberenbaum · May 13, 2021, 1:44pm

I’m going to answer in reverse order because I think it will make more sense.

Yes, that’s what will happen. In that example, 10 epochs is intended to be a starting point rather than a set number of epochs. In many cases, a checkpoints script might be an infinite loop, and the user will determine when to stop training based on the trajectory of the metrics. The checkpoints feature is intended for circumstances where you don’t know how long you want to train and want to keep adding iterations (and possibly go back if performance starts getting worse).

If you know the number of epochs you want, I would not suggest using checkpoints. If you want to log the metrics from each epoch, you might want to look at dvclive.

If you are using checkpoints and want to use the model in other DVC stages, you can use dvc exp run --downstream other_model_stage to run your pipeline starting from other_model_stage so that the checkpoints model and upstream stages are not executed. Would that work for you?

Keep in mind that checkpoints are still experimental, and the initial focus was on experimenting with the checkpoints stage more than using checkpoint models in downstream stages. If you want to share more details on your workflow, it could be help future development on how checkpoints fit into a larger DVC pipeline.

kwon-young · May 13, 2021, 2:59pm

Thank you very much for your explanation.

My workflow is the following:

import dataset from a data registry and other data from url
preprocessing stage of data from url with output: preprocessing.data
train detector using dataset and processing.data

run multiple experiment by varying parameters
train each expe for 25 epoch and checkpoint using dvclive

manually select the best performing detector
train another model using dataset, preprocessing.data and best performing detector

run multiple expe by varying parameters
each each expe for n epochs

manually select best model

I suppose this can work for me.

I better understand this now, thank you. Another use case of checkpoints is to be able to restart the training even if there is some unknown failures, which is my main use case.

dberenbaum · May 13, 2021, 3:48pm

Can you explain this in more detail? Are you restarting from scratch or from a specific epoch? For example, your experiment fails after 12 epochs and you want to debug and then run epochs 13-25?

kwon-young · May 13, 2021, 4:02pm

Yes, that’s the idea.
In my institution, we have a supercomputing grid on which we can run special “besteffort” jobs, which can be stopped at any time but can be run on more computing nodes.

dberenbaum · May 13, 2021, 6:04pm

Ah, so your jobs may be cancelled before they run to completion. In that case, instead of iterating over a constant range of steps/epochs like in dvc-checkpoints-mnist, you want to track the step number by reading in dvclive.json (or whatever path you are using for dvclive) and iterate until that step number is reached, right? That should also allow you to dvc exp run without having to manually run the downstream stages separately.

kwon-young · May 13, 2021, 6:18pm

yes, that is a good idea. I would just have to also add a check at the beginning of my script to see if there are more epoch to be done and if not, quit before doing the long and costly setup of loading data and models.

Topic		Replies	Views
Tracking and resuming DVC experiments with checkpoints Questions	1	166	March 18, 2024
Re-run changed final stage on previous model versions Questions	2	534	May 19, 2020
Experiment duplicates Questions	0	273	April 18, 2023
Statistical significant stage best practice Questions	9	895	June 9, 2021
Using DVC for end-to-end pipeline Questions	6	1625	January 5, 2019

Experiments with checkpoint

Related topics