By publishing this post, I’m following Cunninghams’ Law, hoping to get feedback on how you use DVC in a team, and what your opinions on our yet unresolved points (or the concept in general) are.
Hi @rabefabi!
Thank you very much for sharing your experiences!
Your setup seems to me to be robust and though through.
About unresolved problems:
Problem #1.
Using some kind of CI run on debug dataset, to verify each pull request does not break existing code, even if its kind of configuration change definitely seems like a reasonable idea.
Problem #2.
Also, we’re unsure how the DVC garbage collection treats data from executions which are “just in the run cache”.
What do you mean by just in the run cache? Are you using only -O options for dvc run?
We should probably have more info on gc and run-cache interaction in our docs.
Problem #3.
Currently its a bit cumbersome. Editing params.yam seems like a way to switch between test run and “master” run. What might be interesting for you is that currently we are developing experiments feature which might help you with your desired goal. With latest dvc version, you could try out (for example) dvc exp run {stage_name} --params epochs=3 - this way you would reproduce given stage_name with epochs=3.
What do you mean by just in the run cache ? Are you using only -O options for dvc run ?
No, we’re using the -o option, but maybe I’ve misunderstood how DVC works here.
My understanding is:
The CI Runner successfully executed the DVC pipeline, and now has the new outputs (trained model, processed data, …) as well as a new dvc.lock in its working directory.
To make these outputs accessible from the developers local machine, I see two options:
a) The CI Runner does ci-runner: dvc push --run-cache, afterwards the I do dev-local: dvc pull --run-cache && dvc repro.
Disadvantages:
AFAIK I can’t pull the run cache for just one stage (I need to pull all the preprocessed data as well as the trained model, even if I only want dvc pull --run-cache model_training)
Looking at a git branch, it’s not immediately obvious whether a successful dvc repro has been executed here (compared to a commit with a changed dvc.lock)
I couldn’t find anything in the garbage collection docs regarding how to clean the run cache, or which runs stay cached after cleaning
Advantages:
No extra commit necessary, to check out the results of an experiment, I can pull the commit with which an experiment was started
CI-Runner needs to be configured to be able to commit to a branch
The “results” (the dvc.lock) is stored in a separate commit from the one which started the experiments (relevant for e.g., MLFLow which stores the commit hash of an experiment)
Advantages
The changed dvc.lock in a branch tells me that this experiment has been successfully executed
I can pull single dvc stages ( dvc pull model_training) without downloading all preprocessed data
For all these reasons we opted for option b)
Problem 3
Your new solution (dvc exp run {stage_name} --params epochs=3) seems ideal, thanks for the hint! Together with the parameterlike dependencies we can streamline our project a lot