Right way to provide optional parameters to script in experiments

Hi! What is a supposed way to deal with optional params for scripts in dvc.yaml?
Let’s suppose we have a script which could be run like python train.py or python train.py --resume path-to-model-weights.

I can come up to something like this:

# dvc.yaml
stages:
  train:
    deps:
      - train.py
    cmd: python train.py ${resume}
# params.yaml
resume: ""

and in case I want to run an experiment and resume training, use dvc exp run -S resume="--resume path-to-model-weights"

But maybe I’ve missed more elegant solution? Something that will allow dvc exp run -S resume=path-to-model-weights.

@agushin
I think that we did not have this use case before, and your approach seems to be valid. @skshetry might have more information then me in that matter.

Thank you for the answer! I hope @skshetry could provide more information about this case.

Also I’m not sure what is the right way to handle situations when a script should be called like this (note the arbitrary amount of values supplied to --numbers

python calculator.py --numbers 1 2 3 4 --operation sum
# or
python calculator.py --numbers 1 2 --operation multiply
# or
python calculator.py --operation sum --numbers 1 2 3 4 5 6

If I’d have constant number of arguments, then I would do something like this, which is already not very beatiful:

# params.yaml
numbers: [1, 2, 3, 4]
operation: sum

# dvc.yaml
stages:
  calculate:
    cmd: python calculator.py --numbers ${numbers[0]} ${numbers[1]} ${numbers[2]} ${numbers[3]} --operation ${operation}

And if for some reason either

  1. I want to have arbitrary number of values in this list
  2. I want to have an option to skip this argument

then I don’t have a good idea how to handle this without modifying my python script, except may be by treating this parameter as a string again:

# params.yaml
numbers: "--numbers 1 2 3 4"
operation: sum

# dvc.yaml
stages:
  calculate:
    cmd: python calculator.py ${numbers} --operation ${operation}

Would be great to know more beatiful solution, if it exists!

This seems to me like a case where it would be better to just support reading values directly from params.yaml in your python script, instead of always passing them via the command line

1 Like

I’d also be interested in how to handle command-line program arguments in DVC. The example provided by @aguschin is a good one. For some use cases, it just doesn’t make a lot of sense to have a parameter defined in YAML, but instead pass it as a program argument - particularly for mandatory “runtime” parameters, that are not a configuration option in that sense.

Another example would be a pipeline, that processes and analyzes satellite image data for certain points in time. One of the early pipeline steps would be to download that data from some online repository, so I’d like to have a param like --date, that I can just pass without first having to edit params.yaml, perhaps even because I’m running the pipeline headlessly on some server.

Does DVC support such use cases?

Edit: Moreover, from my understanding, the params file is meant to be checked in to version control, which is another hint that this is the wrong place for “runtime” arguments like the date one from above, which have no default value or so.

Hi @ferdi , even though there might be some limitations depending on your use case, having parameters tracked by DVC should not be incompatible with the use cases you described.

Let’s say you have:

# params.yaml
data: 16-02-2022

You can use Templating and DVC will interpolate those params into program arguments:

# dvc.yaml
stages: 
  process_satellite:
    cmd: python process_satellite.py --date ${date}

In addition, you can use --set-param option of dvc exp run to modify parameters on runtime:

dvc exp run -S date='17-02-2022'

The above snippet is from CLI, so you could have other logic setting an env var (i.e. $DATE) instead of hardcoding values:

dvc exp run -S date=$DATE
1 Like

Very helpful, thanks! Especially for pointing me to the on-the-fly parameter changes.

I think I found a problem with this approach. Take your dvc.yml, but add another stage (say postprocess_satellite), that depends on some output from the process_satellite stage.

  1. Run dvc exp run -S date='17-02-2022' && dvc repro :arrow_right: Both stages will run initially
  2. Run dvc exp run -S date='18-02-2022' && dvc repro :arrow_right: Both stages will run for the new date
  3. Run dvc exp run -S date='17-02-2022' && dvc repro process_satellite :arrow_right: First stage will run again, but shouldn’t.

In step 3, first stage should be skipped, because it had already run before with same parameter input. However, it runs for reason 'cmd' of stage: 'preprocess' has changed.

dvc exp run is meant to be a complete replacement for dvc repro, and it is intended for generating experiments which are completely independent of each other (see: https://dvc.org/doc/start/experiments)