The link is broken
@drorata faxed. Sorry, but this is how the blog engine works now - you see the discord message first
CONGRATS for this amazing milestone!!!
I have few points that might need further clarification:
Multi-stage DVC files: How the
paramssection relates to the code in
process_filespecified in the code?
Run cache: I am so used to the linkage between
gitthat I find this section confusing… Can
dvc“persist” a state without having “S3, Azure Blob, SSH” around? Maybe few more examples could be helpful here.
Plots: Now I’m even more confused.
dvcdoesn’t use the
gitanymore to track the state, but the plots compare git hashes?
Thank you @drorata! Happy to clarify…
Multi-stage DVC files - You suppose to specify the parameters in
dvc run -p process_file,click_threshold ... ./process_raw_data ... and just ready the params from the
Run cache First, you have the linkage between dvc and Git - just commit
dvc.lock into Git that contains all the links. Run-cache is needed when you want to avoid the commit for some reason.
Second, run-cache has a long memory of runs. If you change code and hyper params to values that were already used (with or without commits)
dvc repro will find it from the run-cache and instantly return it back without wasting time on training.
Plots It does use Git (if you need it) and it extracts all the diffs properly.
I’ll get into Run cache more:
dvc addstill creates a pointer file to some data. You are still supposed to commit it to git and it still contains checksum reference to your data. In this sense dvc is still versions your data with git.
Run cache works for
dvc run/repro. When you run a command you have a combination of deps: data files, code and params, which produce some result. This result is saved into run cache and
dvc.lockfile. If you commit the changes to lock file to git then it works the same as before.
- You typically use run cache by tweaking your code or params and rerunning some stage without intermediate commits. If you happen to return to a combination of code, params and data you already tried the stage result will be fetched from cache instead of rerunning the stage command on
- You don’t need to have S3 or any other remote cache to use run cache so far. If you do have some remote then your local run cache will be sent and received to and from remote along with usual cache on
dvc push/pullcommands. This enables you and your teammates to save quickly reuse each others results on different machines. This also enables CI or any other cloud/remote/background job to add precalculated runs to run cache, which may be quickly fetched later, e.g. in your dev environment or on production system.
I have a question… How should I update my projects using dvc 0.94 to use dvc 1.0? Should I start from scratch and redefine my pipeline?
@Suor a couple questions on this:
dvc pull/pushneed a special – option to download/upload the run-cache?
- Does the run-cache include the actual cached outputs from the pipeline in each experiment? (That could be quite big since each different output is repeated in cache even if only one byte changed.)
It depends a little bit what kind of 0.94 project you had because usually they have one .dvc file per stage but the feature to have a single pipelines file (pipelines.yaml) already existed (it was just partially hidden).
The best way is to combine all .dvc files into the new dvc.yaml manually (see expected format) and run
dvc repro at the end to check the file and stages are valid, and to regenerate the outputs and put them in cache. You could skip the cache part if they’re already there with
It may also be possible to skip
dvc repro entirely by using
dvc commit instead but I think this is more advanced and a little difficult since it may expect you to also create
dvc.lock which is not that easy to edit manually.
Can we still create multiple pipelines in the same project?
@sdelo absolutely! There is no requirements for commands to be connected into a single pipeline. You can create as many as needed.
@sdelo also please note that currently you can have a
dvc.yaml file (where stages and pipelines are defined) in each subdirectory.
dvc run creates or updates it in the current working directory. Thanks
Thanks for the feedback. I ask because when defining multiple pipelines, the dvc.yaml put them all under stage tag. Is there a way to see pipeline groupings in the dvc.yaml or must I use the dvc dag to see the groupings?
That’s great feedback @sdelo, thanks.
No, there’s no way to change the overall structure of
dvc.yaml at the moment. Feel free to open a feature request in https://github.com/iterative/dvc/issues to support multiple names lists (pipelines) under
For now stages are just added as they come if you use
dvc run. It’s best to edit
dvc.yaml manually so you can easily order them chronologically.