Git Flow for DVC 🌿

Thanks for taking the time to give feedback!

Problem 2

What do you mean by just in the run cache ? Are you using only -O options for dvc run ?

No, we’re using the -o option, but maybe I’ve misunderstood how DVC works here.

My understanding is:
The CI Runner successfully executed the DVC pipeline, and now has the new outputs (trained model, processed data, …) as well as a new dvc.lock in its working directory.
To make these outputs accessible from the developers local machine, I see two options:

a) The CI Runner does ci-runner: dvc push --run-cache, afterwards the I do dev-local: dvc pull --run-cache && dvc repro.

  • Disadvantages:
    • AFAIK I can’t pull the run cache for just one stage (I need to pull all the preprocessed data as well as the trained model, even if I only want dvc pull --run-cache model_training)
    • Looking at a git branch, it’s not immediately obvious whether a successful dvc repro has been executed here (compared to a commit with a changed dvc.lock)
    • I couldn’t find anything in the garbage collection docs regarding how to clean the run cache, or which runs stay cached after cleaning
  • Advantages:
    • No extra commit necessary, to check out the results of an experiment, I can pull the commit with which an experiment was started

b) ci-runner: git add dvc.lock && git commit && git push && dvc push

  • Disadvantages:
    • CI-Runner needs to be configured to be able to commit to a branch
    • The “results” (the dvc.lock) is stored in a separate commit from the one which started the experiments (relevant for e.g., MLFLow which stores the commit hash of an experiment)
  • Advantages
    • The changed dvc.lock in a branch tells me that this experiment has been successfully executed
    • I can pull single dvc stages ( dvc pull model_training) without downloading all preprocessed data

For all these reasons we opted for option b)

Problem 3

Your new solution (dvc exp run {stage_name} --params epochs=3) seems ideal, thanks for the hint! Together with the parameterlike dependencies we can streamline our project a lot