`dvc pull --run-cache [target]`

Hi!

Context

Here is my use case:

  1. I train a model automatically through whatever way (CML, Airflow)
  2. Now I have a weights.db file, which is versioned by DVC alongside other intermediary data files
  3. I do a dvc push --run-cache as I do not want to version with dvc.lock since I am not committing anything
    –> Now there are several files in my S3.
    –> Time to run my model in prod
  4. In a Docker container, run dvc pull --run-cache generate_weights (the stage that created weights.db)
  5. DVC starts to pull all files from remote, not only previous weights.db versions, but also caches from other stages.

This makes my approach unfeasible as pulling all files from remote takes a long time.

Question(s)

  1. Should dvc pull --run-cache [target] pull all files?
  2. Should I be doing it the way I am doing? I am avoiding CML/Git-related solutions.

Thank you!

2 Likes

Right now --run-cache works this way, unfortunately. DVC cannot reconstruct the entire DAG without having all the files in the workspace. Thus it cannot pull part of the DAG (a single file or a stage).

Granular pulling from run-cache is a coming feature. I’ll try to prioritize it. Btw… you can go ahead and create the feature-request/bug-report with the link to this discussion https://github.com/iterative/dvc/issues/. User-created issues usually have higher priorities :slight_smile:

Hi @victor.maricato, thanks for the great question!

Could you elaborate on this please?

Also, could you please share the reason of not using Git/committing before releasing your model to production? To my mind that’s where DVC gives a lot of value when we move to production - by capturing the dvc.lock. It ensures that you get exactly right artifact (hash sum), it gives you ability to get back, etc.

I would really love to learn more about your workflow! Thanks.

The core reason we decided not to do through Git is: we are experimenting based on what we already do.

Elaborating a little more:

Nowadays we retrain a model through an Airflow DAG, using a Kubernetes Pod to run a Docker container that knows how to retrain and save the model on remote, which production application checks when self-updating its binaries.

To use CML, we were going through the cml.dev Cloud Computing section, and we realized that to run the CML in Kubernetes would be much more harder than we were expecting.

Then, the decision was to try keeping it inside our current stack and bend DVC (only) to our use case.

1 Like

@victor.maricato thanks! :pray:

Then, the decision was to try keeping it inside our current stack and bend DVC (only) to our use case.

But what value does DVC give you in this scenario? If DAG is described by Airflow, and you don’t commit anything to Git. Could you describe how exactly do you integrate DVC into this?

To use CML, we were going through the cml.dev Cloud Computing section, and we realized that to run the CML in Kubernetes would be much more harder than we were expecting.

yes, really good point. It’s on our roadmap - to connect it with k8s. Do you have a preference in terms of how should it look like?

Hi @shcheklein

But what value does DVC give you in this scenario? If DAG is described by Airflow, and you don’t commit anything to Git. Could you describe how exactly do you integrate DVC into this?

Actually, I am committing to git.

The Airflow launches a KubernetesPodOperator, which then runs dvc repro. In this case, the execution date is a parameter of my DVC pipeline/dag.

After repro is run, I use bash to execute git commands and opening a MR with updated dvc.lock in the branch tagged with an execution date.

What I am doing is essentially CML, but with Kubernetes and without CML :slightly_frowning_face:

yes, really good point. It’s on our roadmap - to connect it with k8s. Do you have a preference in terms of how should it look like?

I am not exactly familiarized with DVC and K8S enough to tell you how I visualize it. Probably my opinion would not reflect most situations. I don’t even think I have a concrete opinion about it. If I think of something interesting I will let you know.

Side question

Is there a more elegant way of schedulling the dvc repro through CML?

I am currently stuck at this task and any suggestions are welcome.

Okay, makes sense. What happens after MR is created? Does your CI pick it up immediately and creates the new docker image for the production? Or it happens after you merge it?

Also, if you do commit the dvc.lock file you probably don’t need run cache after all … you can do dvc pull model.plk or something.

Is there a more elegant way of schedulling the dvc repro through CML?

My take on this is that CML implies a different workflow compared to Airflow. MRs or commits trigger some execution, it doesn’t happen on a schedule like with Airflow. I think that setup you described is very reasonable and you need CML in it to validate MR, generate some report with metrics that summarizes that MR.

What happens after MR is created? Does your CI pick it up immediately and creates the new docker image for the production? Or it happens after you merge it?

It happens after the merge request is accepted. metrics.yaml is reviewed by a developer and him/her decides to approve it or not.

Also, if you do commit the dvc.lock file you probably don’t need run cache after all

The reason I wanted to do it with --run-cache was to not use git at all. I am only using git and this MR policy as a workaround. With --run-cache, production model would pull from it automatically and Airflow would train and push dvc.lock with updated weights.

My take on this is that CML implies a different workflow compared to Airflow. MRs or commits trigger some execution, it doesn’t happen on a schedule like with Airflow. I think that setup you described is very reasonable and you need CML in it to validate MR, generate some report with metrics that summarizes that MR.

Yes, this is probably the way I am going. But ATM the CML report is done “manually” :eyes:

Okay, so, let’s imagine we solved the problem with run-cache granularity. You don’t need git then, you don’t need MRs then. What would be the value of DVC in that scenario? Would it make sense to just upload the model file to some predefined location?

You don’t need git then, you don’t need MRs then.

You are right, I would not need any of this.

What would be the value of DVC in that scenario? Would it make sense to just upload the model file to some predefined location?

DVC would keep track of changes, params and metrics in each run.

If I just retrained the model and copied it to a shared directory with production application, I would lose all the code + params < - > model versioning and caching DVC provides.

In example

Let’s say I want to revert my last run because idk, model was not behaving as expected.

In current scenario, this would be detected in MR (CML-like way)

In the scenario with --run-cache, this would make retrain -> deploy faster, though less stable.

If at any moment the revert is needed, --run-cache would be there to help.

1 Like

I see that you want to use run-cache as key-value store for code + params < - > model that you can pull/push from/to as you run your DAG regularly. It all makes sense.

What I still don’t quite understand is how do you move code+params combination to the production env. You need them for DVC to hit proper entry in the run cache, right?

With git the mechanism is clear - you commit them and do git pull first.

In current scenario, this would be detected in MR (CML-like way)

Unless I’m missing something, if you need to revert if you do commits you don’t need to retrain anything - you can just do dvc get --rev HEAD^1 github.com/repo/url model.pkl to get the previous model.

Yes, you are right.

I did not exactly understand the question, what exactly do you mean by move code + params to production?

I can answer that code is shared through a Docker image, which is built on CI stages.
Params are moved through the Docker image also, but only after MR is accepted (image is only built on the master branch)

Does that answer your question?

Sure. Let me try to clarify. Run cache as discussed is kinda a key-value store. The key is determined as a combination of params + code + data dependencies.

Let’s imagine we triggered a new run with some new set of params1 + code1 + data1 that produced model1. We pushed this run cache w/o doing a commit.

Now, to get model1 in the production machine we need to know the combination params1 + code1 + data1. When we do commits everything is clear to me - we have everything in the commit. But if we don’t do a commit, if we don’t use git, if we don’t use MR - how will production machine know about that combination of params1 + code1 + data1 to fetch the model1 file?

In run cache there is no notion of the “last” run, there is not order. At least at the moment.

I hope it clarifies the concern. Please let me know if you want to explain it further or from a different angle.

In this case, run-cache would track params and data. These are the ones that change when dvc runs in Airflow retraining pipeline.

Therefore

Day 0: Production and Airflow have data1 + params 1 + code1
Day 1: Airflow retrains model, now, it saves with run-cache the combination: data2 + params2 + code1
Then, production model checks for updates in run-cache periodically and updates itself, yielding a production environment with: data2, params2, code1.

This is how I am imagining it would work with run-cache and no commits

Got it now!

Yes, it makes sense, but unfortunately it’s not how run cache is organized at the moment (to the best of my knowledge). It indeed tracks data2, params2, code1 and the corresponding model2. The problem is on the consumer side - there is no way to get the most “recent update”. It’s literally a key-value store, a black-box that can give you a result if you have a key, but can’t you give an ordered list of keys (may be there an easy way to hack this by traversing it manually).

So, besides granularity other things should be improved/changed/fixed for this to work.

@victor.maricato btw, one question - what about changes to the code? If we don’t do commits where should we save them? I would imagine we would need them if we want to know exactly what combination of code + params + data produced the model.

:eyes: But assuming I have the date (unique param in my case), the code (assuming it is a constant) and the data (versioned/parameterised by date), I have the key-value I am looking for.

However, indeed this can be a very unstable environment and some try/catch policies would be required to recursively get the latest available model.

I think what is the key game changer here at the moment in order to this to work in my environment is the granular dvc pull --run-cache [target].

I don’t see any other blocker at least at this very moment.

We do commit, in the git repo. Git repo builds the Docker image that is shared between production and retraining environment. Therefore, if any code changes, it will be propagated through the container to both environments.

This way I decouple changes in code (such as optimisation) of retraining pipeline (which should never change code, only params, specifically date). But, I guarantee that they will always be running the same source code.