DVC 1.0 release

2 Likes

The link is broken :frowning:

@drorata faxed. Sorry, but this is how the blog engine works now - you see the discord message first :slight_smile:

2 Likes

CONGRATS for this amazing milestone!!! :heart::heart::heart:

I have few points that might need further clarification:

  • Multi-stage DVC files: How the params section relates to the code in process_raw_data? Is process_file specified in the code?
  • Run cache: I am so used to the linkage between dvc and git that I find this section confusing… Can dvc “persist” a state without having “S3, Azure Blob, SSH” around? Maybe few more examples could be helpful here.
  • Plots: Now I’m even more confused. dvc doesn’t use the git anymore to track the state, but the plots compare git hashes?
2 Likes

Thank you @drorata! Happy to clarify…

Multi-stage DVC files - You suppose to specify the parameters in dvc run -p process_file,click_threshold ... ./process_raw_data ... and just ready the params from the params.yaml.

Run cache First, you have the linkage between dvc and Git - just commit dvc.lock into Git that contains all the links. Run-cache is needed when you want to avoid the commit for some reason.

Second, run-cache has a long memory of runs. If you change code and hyper params to values that were already used (with or without commits) dvc repro will find it from the run-cache and instantly return it back without wasting time on training.

Plots It does use Git (if you need it) and it extracts all the diffs properly.

1 Like

I’ll get into Run cache more:

  1. dvc add still creates a pointer file to some data. You are still supposed to commit it to git and it still contains checksum reference to your data. In this sense dvc is still versions your data with git.
  2. Run cache works for dvc run/repro. When you run a command you have a combination of deps: data files, code and params, which produce some result. This result is saved into run cache and dvc.lock file. If you commit the changes to lock file to git then it works the same as before.
  3. You typically use run cache by tweaking your code or params and rerunning some stage without intermediate commits. If you happen to return to a combination of code, params and data you already tried the stage result will be fetched from cache instead of rerunning the stage command on dvc repro.
  4. You don’t need to have S3 or any other remote cache to use run cache so far. If you do have some remote then your local run cache will be sent and received to and from remote along with usual cache on dvc push/pull commands. This enables you and your teammates to save quickly reuse each others results on different machines. This also enables CI or any other cloud/remote/background job to add precalculated runs to run cache, which may be quickly fetched later, e.g. in your dev environment or on production system.
4 Likes

Cool stuff! :slight_smile:

I have a question… How should I update my projects using dvc 0.94 to use dvc 1.0? Should I start from scratch and redefine my pipeline?

1 Like

@Suor a couple questions on this:

  • Does dvc pull/push need a special – option to download/upload the run-cache?
  • Does the run-cache include the actual cached outputs from the pipeline in each experiment? (That could be quite big since each different output is repeated in cache even if only one byte changed.)
1 Like

@fredtcaroli hi!

That’s a great question and I feel like we should probably publish a guide for it… But it’s not so complicated actually. @skshetry outlined it here: Remove/Redefine Stage

It depends a little bit what kind of 0.94 project you had because usually they have one .dvc file per stage but the feature to have a single pipelines file (pipelines.yaml) already existed (it was just partially hidden).

The best way is to combine all .dvc files into the new dvc.yaml manually (see expected format) and run dvc repro at the end to check the file and stages are valid, and to regenerate the outputs and put them in cache. You could skip the cache part if they’re already there with --no-commit.

It may also be possible to skip dvc repro entirely by using dvc commit instead but I think this is more advanced and a little difficult since it may expect you to also create dvc.lock which is not that easy to edit manually.

1 Like

Can we still create multiple pipelines in the same project?

2 Likes

@sdelo absolutely! There is no requirements for commands to be connected into a single pipeline. You can create as many as needed.

2 Likes

@sdelo also please note that currently you can have a dvc.yaml file (where stages and pipelines are defined) in each subdirectory. dvc run creates or updates it in the current working directory. Thanks

1 Like

Thanks for the feedback. I ask because when defining multiple pipelines, the dvc.yaml put them all under stage tag. Is there a way to see pipeline groupings in the dvc.yaml or must I use the dvc dag to see the groupings?

2 Likes

That’s great feedback @sdelo, thanks.

No, there’s no way to change the overall structure of dvc.yaml at the moment. Feel free to open a feature request in https://github.com/iterative/dvc/issues to support multiple names lists (pipelines) under stages!

For now stages are just added as they come if you use dvc run. It’s best to edit dvc.yaml manually so you can easily order them chronologically.

Best

1 Like