Step with no dependency keeps running


#1

Hi!

When I have a step in my pipeline that does not have any dependencies, e.g. a wget command, dvc repro will call that command every time, even if the output is already there. Is this intended behavior?

For my case it does not really make sense as the command should only be run if the output (the downloaded file) is not there already. I know about setting up external dependencies (remotes), but they don’t really work for me here…


#2

Hi @peter !

Yes, it is an intended behavior. We call it a “callback stage”, since it is getting called every time. It is useful when your stage needs to check some condition or do something before any other stage gets called. For example, sometimes when you want to wget something from a server that doesn’t provide ETAG for the file, you have no other way of knowing if file has changed other than downloading it and comparing it to the previous version. With a callback stage, you can add a callback stage dvc run -o data wget data that will download the data and if it didn’t change since the previous run, it won’t trigger reproduction of the stages down the pipeline. In your particular scenario, if you don’t want that wget stage to run every time, there are a few options you can go about it:

  1. Lock the wget stage so that it doesn’t get reproduced anymore:

    $ dvc run -o data wget https://example.com/data
    $ dvc lock data.dvc
    

    And if you wish to run it later, you can simply use dvc unlock data.dvc to unlock it, run repro and then lock
    it back again.

  2. You can add a dummy dependency to the stage that wget is getting called in. For example:

    $ echo dummy > dummy
    $ dvc run -d dummy -o data wget https://example.com/data
    

    This way dvc will check trigger reproduction of this stage only if dummy or data files were changed.

  3. If your server does support ETAG for the file, then the proper way to go would be to use external dependency feature by using:

    $ dvc import htpps://example.com/data data
    

    or

    $ dvc run -d htpps://example.com/data -o data wget htpps://example.com/data
    

    Unfortunately this feature is not yet implemented for HTTP(s) remote. I’ve added https://github.com/iterative/dvc/issues/1146 to track the progress on that. Please let us know if you would like to use this feature and we will up the priority on it and will implement and release it ASAP.

Thanks,
Ruslan


#3

Thanks for your reply, @kupruser!

I forgot about the lock feature, but that would probably be the ideal solution for us. I also like that you get a warning that a stage is locked when reproducing, otherwise it might lead to confusion for coworkers.

Thanks again,
Peter