The link is broken
@drorata faxed. Sorry, but this is how the blog engine works now - you see the discord message first
CONGRATS for this amazing milestone!!!
I have few points that might need further clarification:
-
Multi-stage DVC files: How the
params
section relates to the code inprocess_raw_data
? Isprocess_file
specified in the code? -
Run cache: I am so used to the linkage between
dvc
andgit
that I find this section confusing… Candvc
“persist” a state without having “S3, Azure Blob, SSH” around? Maybe few more examples could be helpful here. -
Plots: Now I’m even more confused.
dvc
doesn’t use thegit
anymore to track the state, but the plots compare git hashes?
Thank you @drorata! Happy to clarify…
Multi-stage DVC files - You suppose to specify the parameters in dvc run -p process_file,click_threshold ... ./process_raw_data ...
and just ready the params from the params.yaml
.
Run cache First, you have the linkage between dvc and Git - just commit dvc.lock
into Git that contains all the links. Run-cache is needed when you want to avoid the commit for some reason.
Second, run-cache has a long memory of runs. If you change code and hyper params to values that were already used (with or without commits) dvc repro
will find it from the run-cache and instantly return it back without wasting time on training.
Plots It does use Git (if you need it) and it extracts all the diffs properly.
I’ll get into Run cache more:
-
dvc add
still creates a pointer file to some data. You are still supposed to commit it to git and it still contains checksum reference to your data. In this sense dvc is still versions your data with git. -
Run cache works for
dvc run/repro
. When you run a command you have a combination of deps: data files, code and params, which produce some result. This result is saved into run cache anddvc.lock
file. If you commit the changes to lock file to git then it works the same as before. - You typically use run cache by tweaking your code or params and rerunning some stage without intermediate commits. If you happen to return to a combination of code, params and data you already tried the stage result will be fetched from cache instead of rerunning the stage command on
dvc repro
. - You don’t need to have S3 or any other remote cache to use run cache so far. If you do have some remote then your local run cache will be sent and received to and from remote along with usual cache on
dvc push/pull
commands. This enables you and your teammates to save quickly reuse each others results on different machines. This also enables CI or any other cloud/remote/background job to add precalculated runs to run cache, which may be quickly fetched later, e.g. in your dev environment or on production system.
Cool stuff!
I have a question… How should I update my projects using dvc 0.94 to use dvc 1.0? Should I start from scratch and redefine my pipeline?
@Suor a couple questions on this:
- Does
dvc pull/push
need a special – option to download/upload the run-cache? - Does the run-cache include the actual cached outputs from the pipeline in each experiment? (That could be quite big since each different output is repeated in cache even if only one byte changed.)
@fredtcaroli hi!
That’s a great question and I feel like we should probably publish a guide for it… But it’s not so complicated actually. @skshetry outlined it here: Remove/Redefine Stage - #12 by skshetry
It depends a little bit what kind of 0.94 project you had because usually they have one .dvc file per stage but the feature to have a single pipelines file (pipelines.yaml) already existed (it was just partially hidden).
The best way is to combine all .dvc files into the new dvc.yaml manually (see expected format) and run dvc repro
at the end to check the file and stages are valid, and to regenerate the outputs and put them in cache. You could skip the cache part if they’re already there with --no-commit
.
It may also be possible to skip dvc repro
entirely by using dvc commit
instead but I think this is more advanced and a little difficult since it may expect you to also create dvc.lock
which is not that easy to edit manually.
Can we still create multiple pipelines in the same project?
@sdelo absolutely! There is no requirements for commands to be connected into a single pipeline. You can create as many as needed.
@sdelo also please note that currently you can have a dvc.yaml
file (where stages and pipelines are defined) in each subdirectory. dvc run
creates or updates it in the current working directory. Thanks
Thanks for the feedback. I ask because when defining multiple pipelines, the dvc.yaml put them all under stage tag. Is there a way to see pipeline groupings in the dvc.yaml or must I use the dvc dag to see the groupings?
That’s great feedback @sdelo, thanks.
No, there’s no way to change the overall structure of dvc.yaml
at the moment. Feel free to open a feature request in https://github.com/iterative/dvc/issues to support multiple names lists (pipelines) under stages
!
For now stages are just added as they come if you use dvc run
. It’s best to edit dvc.yaml
manually so you can easily order them chronologically.
Best