Git Flow for DVC 🌿

rabefabi · December 9, 2020, 9:36am

Hey everyone,

Over the past couple of months we have started using DVC in our small team.

With a handful of developers all coding, training models & committing in the same repository, we soon realized the need for a workflow.

We came up with a concept inspired by Git Flow, which I outlined in this post:

https://git.rz.uni-augsburg.de/rabefabi/git-flow-for-dvc

By publishing this post, I’m following Cunninghams’ Law, hoping to get feedback on how you use DVC in a team, and what your opinions on our yet unresolved points (or the concept in general) are.

Cheers,

Fabian

Paffciu · December 9, 2020, 10:33am

Hi @rabefabi!
Thank you very much for sharing your experiences!
Your setup seems to me to be robust and though through.

About unresolved problems:
Problem #1.
Using some kind of CI run on debug dataset, to verify each pull request does not break existing code, even if its kind of configuration change definitely seems like a reasonable idea.

Problem #2.

Also, we’re unsure how the DVC garbage collection treats data from executions which are “just in the run cache”.

What do you mean by just in the run cache? Are you using only -O options for dvc run?
We should probably have more info on gc and run-cache interaction in our docs.

Problem #3.
Currently its a bit cumbersome. Editing params.yam seems like a way to switch between test run and “master” run. What might be interesting for you is that currently we are developing experiments feature which might help you with your desired goal. With latest dvc version, you could try out (for example) dvc exp run {stage_name} --params epochs=3 - this way you would reproduce given stage_name with epochs=3.

elle · December 9, 2020, 8:51pm

@rabefabi this is so cool- do you mind if we share it on our twitter to get more eyes on it? Really love how you are using CI for your DVC pipelines

rabefabi · December 10, 2020, 9:44pm

Thanks for taking the time to give feedback!

Problem 2

What do you mean by just in the run cache ? Are you using only -O options for dvc run ?

No, we’re using the -o option, but maybe I’ve misunderstood how DVC works here.

My understanding is:
The CI Runner successfully executed the DVC pipeline, and now has the new outputs (trained model, processed data, …) as well as a new dvc.lock in its working directory.
To make these outputs accessible from the developers local machine, I see two options:

a) The CI Runner does ci-runner: dvc push --run-cache, afterwards the I do dev-local: dvc pull --run-cache && dvc repro.

Disadvantages:
- AFAIK I can’t pull the run cache for just one stage (I need to pull all the preprocessed data as well as the trained model, even if I only want dvc pull --run-cache model_training)
- Looking at a git branch, it’s not immediately obvious whether a successful dvc repro has been executed here (compared to a commit with a changed dvc.lock)
- I couldn’t find anything in the garbage collection docs regarding how to clean the run cache, or which runs stay cached after cleaning
Advantages:
- No extra commit necessary, to check out the results of an experiment, I can pull the commit with which an experiment was started

b) ci-runner: git add dvc.lock && git commit && git push && dvc push

Disadvantages:
- CI-Runner needs to be configured to be able to commit to a branch
- The “results” (the dvc.lock) is stored in a separate commit from the one which started the experiments (relevant for e.g., MLFLow which stores the commit hash of an experiment)
Advantages
- The changed dvc.lock in a branch tells me that this experiment has been successfully executed
- I can pull single dvc stages ( dvc pull model_training) without downloading all preprocessed data

For all these reasons we opted for option b)

Problem 3

Your new solution (dvc exp run {stage_name} --params epochs=3) seems ideal, thanks for the hint! Together with the parameterlike dependencies we can streamline our project a lot

rabefabi · December 10, 2020, 9:46pm

Of course you can share it, the bigger the discussion gets, the better

elle · December 11, 2020, 8:02pm

we shared on twitter, hope we get some eyes on it. we’ll put in our monthly newsletter, too. https://twitter.com/DVCorg/status/1337486256947646466

i love the idea for a debugging dataset. get all the format/input-output errors out of the way before you get to your full dataset on CI.

Topic		Replies	Views
Best Practice for CI with Run Cache? Questions	1	1073	July 6, 2020
Looking for Workflow Suggestion Questions	2	185	December 21, 2023
Basic DVC workflow Questions	1	414	January 19, 2023
DVC 1.0 release Blog Comments	13	4702	June 26, 2020
Tracking data and code dependencies Questions	4	2136	May 18, 2018

Git Flow for DVC 🌿

Problem 2

Problem 3

Related topics