Git Flow for DVC 🌿

Hey everyone,

Over the past couple of months we have started using DVC in our small team.

With a handful of developers all coding, training models & committing in the same repository, we soon realized the need for a workflow.

We came up with a concept inspired by Git Flow, which I outlined in this post:

https://git.rz.uni-augsburg.de/rabefabi/git-flow-for-dvc

By publishing this post, I’m following Cunninghams’ Law, hoping to get feedback on how you use DVC in a team, and what your opinions on our yet unresolved points (or the concept in general) are.

Cheers,

Fabian

4 Likes

Hi @rabefabi!
Thank you very much for sharing your experiences!
Your setup seems to me to be robust and though through.

About unresolved problems:
Problem #1.
Using some kind of CI run on debug dataset, to verify each pull request does not break existing code, even if its kind of configuration change definitely seems like a reasonable idea.

Problem #2.

Also, we’re unsure how the DVC garbage collection treats data from executions which are “just in the run cache”.

What do you mean by just in the run cache? Are you using only -O options for dvc run?
We should probably have more info on gc and run-cache interaction in our docs.

Problem #3.
Currently its a bit cumbersome. Editing params.yam seems like a way to switch between test run and “master” run. What might be interesting for you is that currently we are developing experiments feature which might help you with your desired goal. With latest dvc version, you could try out (for example) dvc exp run {stage_name} --params epochs=3 - this way you would reproduce given stage_name with epochs=3.

@rabefabi this is so cool- do you mind if we share it on our twitter to get more eyes on it? Really love how you are using CI for your DVC pipelines

Thanks for taking the time to give feedback!

Problem 2

What do you mean by just in the run cache ? Are you using only -O options for dvc run ?

No, we’re using the -o option, but maybe I’ve misunderstood how DVC works here.

My understanding is:
The CI Runner successfully executed the DVC pipeline, and now has the new outputs (trained model, processed data, …) as well as a new dvc.lock in its working directory.
To make these outputs accessible from the developers local machine, I see two options:

a) The CI Runner does ci-runner: dvc push --run-cache, afterwards the I do dev-local: dvc pull --run-cache && dvc repro.

  • Disadvantages:
    • AFAIK I can’t pull the run cache for just one stage (I need to pull all the preprocessed data as well as the trained model, even if I only want dvc pull --run-cache model_training)
    • Looking at a git branch, it’s not immediately obvious whether a successful dvc repro has been executed here (compared to a commit with a changed dvc.lock)
    • I couldn’t find anything in the garbage collection docs regarding how to clean the run cache, or which runs stay cached after cleaning
  • Advantages:
    • No extra commit necessary, to check out the results of an experiment, I can pull the commit with which an experiment was started

b) ci-runner: git add dvc.lock && git commit && git push && dvc push

  • Disadvantages:
    • CI-Runner needs to be configured to be able to commit to a branch
    • The “results” (the dvc.lock) is stored in a separate commit from the one which started the experiments (relevant for e.g., MLFLow which stores the commit hash of an experiment)
  • Advantages
    • The changed dvc.lock in a branch tells me that this experiment has been successfully executed
    • I can pull single dvc stages ( dvc pull model_training) without downloading all preprocessed data

For all these reasons we opted for option b)

Problem 3

Your new solution (dvc exp run {stage_name} --params epochs=3) seems ideal, thanks for the hint! Together with the parameterlike dependencies we can streamline our project a lot

Of course you can share it, the bigger the discussion gets, the better :slight_smile:

we shared on twitter, hope we get some eyes on it. we’ll put in our monthly newsletter, too. https://twitter.com/DVCorg/status/1337486256947646466

i love the idea for a debugging dataset. get all the format/input-output errors out of the way before you get to your full dataset on CI.

1 Like