Hi! What is a supposed way to deal with optional params for scripts in dvc.yaml?
Let’s suppose we have a script which could be run like python train.py or python train.py --resume path-to-model-weights.
@agushin
I think that we did not have this use case before, and your approach seems to be valid. @skshetry might have more information then me in that matter.
Thank you for the answer! I hope @skshetry could provide more information about this case.
Also I’m not sure what is the right way to handle situations when a script should be called like this (note the arbitrary amount of values supplied to --numbers
python calculator.py --numbers 1 2 3 4 --operation sum
# or
python calculator.py --numbers 1 2 --operation multiply
# or
python calculator.py --operation sum --numbers 1 2 3 4 5 6
If I’d have constant number of arguments, then I would do something like this, which is already not very beatiful:
This seems to me like a case where it would be better to just support reading values directly from params.yaml in your python script, instead of always passing them via the command line
I’d also be interested in how to handle command-line program arguments in DVC. The example provided by @aguschin is a good one. For some use cases, it just doesn’t make a lot of sense to have a parameter defined in YAML, but instead pass it as a program argument - particularly for mandatory “runtime” parameters, that are not a configuration option in that sense.
Another example would be a pipeline, that processes and analyzes satellite image data for certain points in time. One of the early pipeline steps would be to download that data from some online repository, so I’d like to have a param like --date, that I can just pass without first having to edit params.yaml, perhaps even because I’m running the pipeline headlessly on some server.
Does DVC support such use cases?
Edit: Moreover, from my understanding, the params file is meant to be checked in to version control, which is another hint that this is the wrong place for “runtime” arguments like the date one from above, which have no default value or so.
Hi @ferdi , even though there might be some limitations depending on your use case, having parameters tracked by DVC should not be incompatible with the use cases you described.
Let’s say you have:
# params.yaml
data: 16-02-2022
You can use Templating and DVC will interpolate those params into program arguments:
I think I found a problem with this approach. Take your dvc.yml, but add another stage (say postprocess_satellite), that depends on some output from the process_satellite stage.
Run dvc exp run -S date='17-02-2022' && dvc repro Both stages will run initially
Run dvc exp run -S date='18-02-2022' && dvc repro Both stages will run for the new date
Run dvc exp run -S date='17-02-2022' && dvc repro process_satellite First stage will run again, but shouldn’t.
In step 3, first stage should be skipped, because it had already run before with same parameter input. However, it runs for reason 'cmd' of stage: 'preprocess' has changed.
dvc exp run is meant to be a complete replacement for dvc repro, and it is intended for generating experiments which are completely independent of each other (see: https://dvc.org/doc/start/experiments)