Hello,
first of all, thanks for this awesome project! We are very likely to use it as the backbone of our data versioning infrastructure.
One question came up in the process. I am building a template for pipelines that combine various existing software components or scripts. These scripts are called by pipeline stages (via cmd
) and take various command line arguments. These specify the paths to input and output files ingested/created by the scripts. So far so normal.
The issue is that our scripts need absolute paths. They have no way of resolving relative paths and more importantly the relation of the script path to the data path may not always be the same when the repo is used.
For instance say some input data is stored in my_dvc_repo/data/input.csv
. Then the command for the stage may look like python path/to/script/my_script.py --input my_dvc_repo/data/input.csv
. Another user might clone to repos/my_dvc_repo/data/input.csv
. Then the command for the stage must include this different path python path/to/script/my_script.py --input repos/my_dvc_repo/data/input.csv
.
We can deal with that of course. A shell script could modify dvc.yaml
after each clone and adjust the cmd
string to include the correct absolute paths.
The blocker seems to be that the stage’s cmd
is tracked in dvc.lock
. Modifying it appears to count as a change of the stage, meaning it will have to be rerun even though all we changed was the absolute location of the input/output files, not their content and not their relation to the rest of the DVC files.
I have come up with various ways how this as well can be solved, but they appear quite hacky*.
So I would just like to sanity check whether I am missing some obvious way already available in DVC. For instance I know about dvc root
but don’t see how it can be employed in the cmd
multiple times without making the resulting command string overly complex (thinking about combining realpath
with dvc root
each time the absolute repo path is needed).
Appreciating any help! Thanks a lot!
Best regards
Jonas
(*) My current plan is to write a wrapper script which by convention needs to be called first in each cmd
. It would essentially get the rest of the command as string arguments. Paths could use a placeholder for the absolute path (e.g. ROOT
) and the wrapper can then replace and call the actual script. E.g. wrapper python my_script --input ROOT/data/input.csv
. This way the cmd
itself would remain unchanged regardless the absolute location of the repo. Another thought was to use vars
in dvc.yaml
and adjust these to the absolute paths programmatically upon cloning, but I do not think these can be used within the cmd
string.