Dvc.yaml: `deps` & `outs` section accessible from code?

Hey everyone,

How do you avoid having to define the path to dependencies/outputs both in the dvc.yaml as well as your code?

In the dvc.yaml I have to specify paths to dependencies and outputs. In my code, I then will read/write from these paths, but have to define these paths (again) somewhere.
Is there already a concept in place for me to read the paths directly from the yaml (to avoid defining these paths twice?)

Small example:

The dvc.yaml

stages:
  preprocess:
    cmd:  python src/preprocess.py
    deps:
    - data/raw
    - src/preprocessing
    outs:
    - data/preprocessed

The src/preprocess.py

raw_path = Path("data/raw")
preprocessed_path = Path("data/preprocessed")

preprocess(raw_path, preprocessed_path)
...

Obviously, in this case, I could use command line arguments to never spell out the paths inside the script and have my dvc.yaml looking like this:

stages:
  preprocess:
    cmd:  python src/preprocess.py data/raw data/preprocessed
    deps:
    - data/raw
    - src/preprocessing
    outs:
    - data/preprocessed

But this still “duplicates” the path (only inside the yaml), and isn’t always easily feasible (e.g., Jupyter Notebooks).

I’m search for something along the lines of “named dependencies and outputs”:

stages:
  preprocess:
    cmd:  python src/preprocess.py data/raw data/preprocessed
    deps:
      my_raw_path: data/raw
      other:
      - src/preprocessing
    outs:
      my_preprocessed_path: data/preprocessed

Similar to the params.yaml I could then read the dvc.yaml in my code and look for the path in the dict:

raw_path = Path(dvc_yaml["stages"]["preprocess"]["deps"]["my_raw_path"])
...

I’m not convinced whether this solution is the smartest, and was wondering how everyone else is handling this?

Cheers,
Fabian

Hm, after browsing the forum a bit more I feel silly, seems like a similar question was just discussed a couple days ago: Parameterlike dependencies .

If I understand this question and the docs correctly, I can specify the paths in the params.yaml, and have the DVC Pipeline track the values of these parameters as deps/outs:

dvc.yaml:

stages:
  preprocess:
    cmd:  python src/preprocess.py
    deps:
      - ${raw_path}
      - src/preprocess
    outs:
      - ${preprocessed_path}

params.yaml

raw_path: "data/raw"
preprocessed_path: "data/preprocessed"

Just tried it in a small project, and is obviously a better solution than my proposed “named dependencies” from above.

Looking forward to the release!

3 Likes