Hey everyone,
How do you avoid having to define the path to dependencies/outputs both in the dvc.yaml as well as your code?
In the dvc.yaml I have to specify paths to dependencies and outputs. In my code, I then will read/write from these paths, but have to define these paths (again) somewhere.
Is there already a concept in place for me to read the paths directly from the yaml (to avoid defining these paths twice?)
Small example:
The dvc.yaml
stages:
preprocess:
cmd: python src/preprocess.py
deps:
- data/raw
- src/preprocessing
outs:
- data/preprocessed
The src/preprocess.py
raw_path = Path("data/raw")
preprocessed_path = Path("data/preprocessed")
preprocess(raw_path, preprocessed_path)
...
Obviously, in this case, I could use command line arguments to never spell out the paths inside the script and have my dvc.yaml looking like this:
stages:
preprocess:
cmd: python src/preprocess.py data/raw data/preprocessed
deps:
- data/raw
- src/preprocessing
outs:
- data/preprocessed
But this still “duplicates” the path (only inside the yaml), and isn’t always easily feasible (e.g., Jupyter Notebooks).
I’m search for something along the lines of “named dependencies and outputs”:
stages:
preprocess:
cmd: python src/preprocess.py data/raw data/preprocessed
deps:
my_raw_path: data/raw
other:
- src/preprocessing
outs:
my_preprocessed_path: data/preprocessed
Similar to the params.yaml I could then read the dvc.yaml in my code and look for the path in the dict:
raw_path = Path(dvc_yaml["stages"]["preprocess"]["deps"]["my_raw_path"])
...
I’m not convinced whether this solution is the smartest, and was wondering how everyone else is handling this?
Cheers,
Fabian