Remove/Redefine Stage

I made a mistake by defining a stage while forgetting to define its output with the -o option. I tried removing the existing stage with dvc remove pipelines.yaml:my-stage-name -p, but this does not remove the stage from pipelines.yaml. I eventually manually deleted the stage from pipelines.yaml, and created a new stage with the same name using dvc run .... My questions are:

  1. What’s the recommended way to redefine a stage?

  2. Is manually removing the stage from pipelines.yaml sufficient to completely delete the stage, or will there still be metadata about the stage elsewhere?

Hi @nimrand,

To overwrite a stage you can just use the -f (--force) option of dvc run.

I’m a little confused by some of the examples in your message though:

  • dvc pipelines.yaml:my-stage-name -p this doesn’t look like a valid command
  • pipelines.yaml we don’t support this name for pipelines files.

What version of DVC are you using? Maybe you caught a very alpha pre-release that is now quite changed. Can you please share the output of dvc version if this problem persists? Or just try installing the latest beta pre-release. See https://dvc.org/doc/install/pre-release

Thanks!

On this: we encourage you to edit pipelines files (dvc.yaml) and that does remove the stage itself from the pipeline, but you may also want to run dvc commit after that so that dvc.lock is updated as well. Any old files will remain in cache but you can use dvc gc to remove them. Please see gc

@jorgeorpinel Sorry, that first command should have read dvc remove pipelines.yaml:my-stage-name -p. I’ve edited my post for clarity.

1 Like

That’s the file it generates. I have done nothing to change this.

The DVC version is 0.94.0. It was installed to my mac yesterday via pip.

OK, thanks for all the info. Can you also share the dvc run command you used that generates this file? I need to try reproducing your scenario. Thanks

p.s. for dvc remove (see cmd ref) the target expected should just be the .dvc file created by dvc run. But the name should be something like my_stage.dvc for DVC 0.94…

If you’re not very committed to that version of DVC anyway, I’d definitely recommend you upgrade to 1,0 (currently in beta pre-release) though.

I believe the command was dvc run -n train-session-similarity -d data/session_similarity_train.parquet -d train_session_similarity.py python train_session_similarity.py. I forgot to define the output, which is why I needed to redefine it.

I see. Maybe pipelines.yaml was the default name of the stage because you didn’t provide an output. It used to be Dvcfile but maybe the latest 0.94 stable changed it accidentally (that was a first step towards the new dvc.yaml pipelines file we use in 1.0).

Anyway… so yes just use the same dvc run command again but add -f and the -o <output> please!

Oops hold on. Please check dvc run -h to confirm what the overwrite option was in 0.94. It may be --overwrite, not -f.

Sorry, we’re in the mist of this major 1.0 release and I get confused between versions some times.

Thank you for the help.

I tried this is a fresh repo. I cannot get dvc run -n stage-name ... to run a second time with the same stage name no matter what options I try. I’ve tried --overwrite and --overwrite-dvc. In both cases, it gives me an error saying that a stage with that name already exists. There is no -f option.

I’m now trying to just edit the pipelines.yaml file and then calling dvc repro pipelines.yaml (which may also need to be followed by a dvc commit)? The repro is running now, but so far that seems to be working.

Is this the best version to be using?

In 0.94, this feature of having multiple stages was a hidden feature as it was still work-in-progress. We did a lot of changes/improvements since then which is available now to use in v1.0.

pipelines.yaml is now renamed to dvc.yaml and has some file format changes. To redefine a stage, you can use --force and it should overwrite. In 1.0, you can remove a stage from the pipeline file using dvc remove <stage_name>.

Now, for you to migrate, the easiest thing would be to dvc run for all stages and build from there (which should now be dvc.yaml), and then later remove pipelines.yaml.

If you don’t want to dvc run, you can edit manually and then rename pipelines.yaml to dvc.yaml and pipelines.lock to dvc.lock. The incompatibilities between 0.94 and 1.0 in .yaml file is that, we no longer have metrics_no_cache, outs_persist, outs_no_cache kind of sections in dvc.yaml. So, if you do that you might need to make adjustments similar to the following:

In 0.94:

stages:
  stage_one:
    cmd: "command"
    outs:
     - foo
    outs_no_cache:
     - bar
    outs_persist:
     - foobar

In 1.0:

stages:
  stage_one:
    cmd: "command"
    ...
    outs:
     - foo
     - bar:
         cache: false
     - foobar:
         persist: true
1 Like

Np @nimrand thanks for keeping on this.

I cannot get dvc run -n stage-name ... to run a second time

This is confusing again because -n is a 1.0 feature, are you sure you’re on DVC 0.94? :slight_smile:

And yes as Saugat pointed out, the option should be --force — sorry for the confusion (you can double check all the options for your version with dvc run -h).

I’m now trying to just edit the pipelines.yaml file

This is a good strategy. These files are supposed to be easy to edit manually. We want to encourage our users to do this. dvc run is now considered a helper more than the main way to write stages.

may also need to be followed by a dvc commit

Yes: If the deps and outs you manually put in the pipelines file already exist in the workspace and you want DVC to cache them. This way dvc status would report everything is up to date. The other option is to use dvc repro indeed, which will actually run the stages and regenerate all the outputs (and cache them) — may be better to check that the pipeline works.

Is this the best version to be using?

We actually released 1.0 today officially! Please check it out and definitely consider migrating at this point. We will not be supporting 0.94 much longer (meaning releasing more patches).