first of all, thanks for this awesome project! We are very likely to use it as the backbone of our data versioning infrastructure.
One question came up in the process. I am building a template for pipelines that combine various existing software components or scripts. These scripts are called by pipeline stages (via
cmd) and take various command line arguments. These specify the paths to input and output files ingested/created by the scripts. So far so normal.
The issue is that our scripts need absolute paths. They have no way of resolving relative paths and more importantly the relation of the script path to the data path may not always be the same when the repo is used.
For instance say some input data is stored in
my_dvc_repo/data/input.csv. Then the command for the stage may look like
python path/to/script/my_script.py --input my_dvc_repo/data/input.csv. Another user might clone to
repos/my_dvc_repo/data/input.csv. Then the command for the stage must include this different path
python path/to/script/my_script.py --input repos/my_dvc_repo/data/input.csv.
We can deal with that of course. A shell script could modify
dvc.yaml after each clone and adjust the
cmd string to include the correct absolute paths.
The blocker seems to be that the stage’s
cmd is tracked in
dvc.lock. Modifying it appears to count as a change of the stage, meaning it will have to be rerun even though all we changed was the absolute location of the input/output files, not their content and not their relation to the rest of the DVC files.
I have come up with various ways how this as well can be solved, but they appear quite hacky*.
So I would just like to sanity check whether I am missing some obvious way already available in DVC. For instance I know about
dvc root but don’t see how it can be employed in the
cmd multiple times without making the resulting command string overly complex (thinking about combining
dvc root each time the absolute repo path is needed).
Appreciating any help! Thanks a lot!
(*) My current plan is to write a wrapper script which by convention needs to be called first in each
cmd. It would essentially get the rest of the command as string arguments. Paths could use a placeholder for the absolute path (e.g.
ROOT) and the wrapper can then replace and call the actual script. E.g.
wrapper python my_script --input ROOT/data/input.csv. This way the
cmd itself would remain unchanged regardless the absolute location of the repo. Another thought was to use
dvc.yaml and adjust these to the absolute paths programmatically upon cloning, but I do not think these can be used within the