Git Flow for DVC 🌿

rabefabi · December 10, 2020, 9:44pm

Thanks for taking the time to give feedback!

Problem 2

What do you mean by just in the run cache ? Are you using only -O options for dvc run ?

No, we’re using the -o option, but maybe I’ve misunderstood how DVC works here.

My understanding is:
The CI Runner successfully executed the DVC pipeline, and now has the new outputs (trained model, processed data, …) as well as a new dvc.lock in its working directory.
To make these outputs accessible from the developers local machine, I see two options:

a) The CI Runner does ci-runner: dvc push --run-cache, afterwards the I do dev-local: dvc pull --run-cache && dvc repro.

Disadvantages:
- AFAIK I can’t pull the run cache for just one stage (I need to pull all the preprocessed data as well as the trained model, even if I only want dvc pull --run-cache model_training)
- Looking at a git branch, it’s not immediately obvious whether a successful dvc repro has been executed here (compared to a commit with a changed dvc.lock)
- I couldn’t find anything in the garbage collection docs regarding how to clean the run cache, or which runs stay cached after cleaning
Advantages:
- No extra commit necessary, to check out the results of an experiment, I can pull the commit with which an experiment was started

b) ci-runner: git add dvc.lock && git commit && git push && dvc push

Disadvantages:
- CI-Runner needs to be configured to be able to commit to a branch
- The “results” (the dvc.lock) is stored in a separate commit from the one which started the experiments (relevant for e.g., MLFLow which stores the commit hash of an experiment)
Advantages
- The changed dvc.lock in a branch tells me that this experiment has been successfully executed
- I can pull single dvc stages ( dvc pull model_training) without downloading all preprocessed data

For all these reasons we opted for option b)

Problem 3

Your new solution (dvc exp run {stage_name} --params epochs=3) seems ideal, thanks for the hint! Together with the parameterlike dependencies we can streamline our project a lot

Topic		Replies	Views
Best Practice for CI with Run Cache? Questions	1	1072	July 6, 2020
Looking for Workflow Suggestion Questions	2	182	December 21, 2023
Basic DVC workflow Questions	1	413	January 19, 2023
DVC 1.0 release Blog Comments	13	4702	June 26, 2020
Tracking data and code dependencies Questions	4	2132	May 18, 2018

Git Flow for DVC 🌿

Problem 2

Problem 3

Related topics