Yes, it’s normal that a stage with checkpoint outputs is considered always_changed (by dvc status and dvc exp run).
If you think about it, a checkpoint output is in effect a circular dependency, since you can run the experiment again and it should continue from the last version of the output. DVC considers the stage always_changed so that you can do that (just exp run without changing anything will execute the stage anyway). Think of it as a “never ending” stage.
Isn’t this problematic if you are adding stages that depends on artifacts of the training ?
For example, I need to integrate my trained detector in another model which will also be trained.
If I then launch dvc exp run, will my detector be retrained ?
I’m going to answer in reverse order because I think it will make more sense.
Yes, that’s what will happen. In that example, 10 epochs is intended to be a starting point rather than a set number of epochs. In many cases, a checkpoints script might be an infinite loop, and the user will determine when to stop training based on the trajectory of the metrics. The checkpoints feature is intended for circumstances where you don’t know how long you want to train and want to keep adding iterations (and possibly go back if performance starts getting worse).
If you know the number of epochs you want, I would not suggest using checkpoints. If you want to log the metrics from each epoch, you might want to look at dvclive.
If you are using checkpoints and want to use the model in other DVC stages, you can use dvc exp run --downstream other_model_stage to run your pipeline starting from other_model_stage so that the checkpoints model and upstream stages are not executed. Would that work for you?
Keep in mind that checkpoints are still experimental, and the initial focus was on experimenting with the checkpoints stage more than using checkpoint models in downstream stages. If you want to share more details on your workflow, it could be help future development on how checkpoints fit into a larger DVC pipeline.
Ah, so your jobs may be cancelled before they run to completion. In that case, instead of iterating over a constant range of steps/epochs like in dvc-checkpoints-mnist, you want to track the step number by reading in dvclive.json (or whatever path you are using for dvclive) and iterate until that step number is reached, right? That should also allow you to dvc exp run without having to manually run the downstream stages separately.
yes, that is a good idea. I would just have to also add a check at the beginning of my script to see if there are more epoch to be done and if not, quit before doing the long and costly setup of loading data and models.