Adjusting absolute paths in stage command w/o triggering re-run

Jonas · December 12, 2021, 1:07am

Hello,

first of all, thanks for this awesome project! We are very likely to use it as the backbone of our data versioning infrastructure.

One question came up in the process. I am building a template for pipelines that combine various existing software components or scripts. These scripts are called by pipeline stages (via cmd) and take various command line arguments. These specify the paths to input and output files ingested/created by the scripts. So far so normal.

The issue is that our scripts need absolute paths. They have no way of resolving relative paths and more importantly the relation of the script path to the data path may not always be the same when the repo is used.

For instance say some input data is stored in my_dvc_repo/data/input.csv. Then the command for the stage may look like python path/to/script/my_script.py --input my_dvc_repo/data/input.csv. Another user might clone to repos/my_dvc_repo/data/input.csv. Then the command for the stage must include this different path python path/to/script/my_script.py --input repos/my_dvc_repo/data/input.csv.

We can deal with that of course. A shell script could modify dvc.yaml after each clone and adjust the cmd string to include the correct absolute paths.

The blocker seems to be that the stage’s cmd is tracked in dvc.lock. Modifying it appears to count as a change of the stage, meaning it will have to be rerun even though all we changed was the absolute location of the input/output files, not their content and not their relation to the rest of the DVC files.

I have come up with various ways how this as well can be solved, but they appear quite hacky*.

So I would just like to sanity check whether I am missing some obvious way already available in DVC. For instance I know about dvc root but don’t see how it can be employed in the cmd multiple times without making the resulting command string overly complex (thinking about combining realpath with dvc root each time the absolute repo path is needed).

Appreciating any help! Thanks a lot!

Best regards

Jonas

(*) My current plan is to write a wrapper script which by convention needs to be called first in each cmd. It would essentially get the rest of the command as string arguments. Paths could use a placeholder for the absolute path (e.g. ROOT) and the wrapper can then replace and call the actual script. E.g. wrapper python my_script --input ROOT/data/input.csv. This way the cmd itself would remain unchanged regardless the absolute location of the repo. Another thought was to use vars in dvc.yaml and adjust these to the absolute paths programmatically upon cloning, but I do not think these can be used within the cmd string.

Paffciu · December 12, 2021, 5:44pm

Hello @Jonas, thank you for kind words!
Regarding the vars:
they can be used in cmd, take a look at https://dvc.org/doc/user-guide/project-structure/pipelines-files#templating there are examples including the one with vars in cmd. But changing them will also be considered as change in the pipeline so it probably won’t solve your original problem.

Regarding your main question: I am afraid that using one of the “hacky” approaches might be necessary. I don’t think there is a way to specify the absolute path in the cmd. In your case, I would probably try to create some kind of “invoke” script that would take care of calling particular command and converting relpath to abspaths.

Jonas · December 12, 2021, 6:20pm

Hi Paweł,

Thanks a lot for your response! Right, not sure how I could miss that the vars can be used in cmd as well. But, yes, if that’s considered a change as well it is not a solution.

I have now in fact gone the route of having a wrapper script that replaces the string ROOT with the absolute DVC root path and is simply invoked first in each cmd, getting the rest of the cmd as argument. To be honest I shied away from detecting all paths in the cmd and replacing relative ones with absolutes, so I went with this placeholder approach.

Thanks a lot! It helps to know that I am not missing something fundamental.

Here’s the script in case someone has a similar issue (note that I’m running in a Poetry environment, hence the poetry run; it could be omitted otherwise).

#!/bin/bash
# script pathrep
ARGS_STR="$*"
DVC_ROOT=$(readlink -f $(poetry run dvc root))
CMD="${ARGS_STR//"ROOT"/$DVC_ROOT}"           
eval "$CMD"

A cmd could then look like ./pathrep "python src/my_script.py --input ROOT/data/input.csv". The quotes around the rest of the command are needed only if problematic characters are in the command (like semicolon).

Best

Jonas

skshetry · December 13, 2021, 8:12am

@Jonas, we don’t have a way to resync dvc.yaml and dvc.lock, but you can use dvc commit which will do the right thing for you in most cases, in that it will commit changes in your workspace and also resync dvc.yaml and dvc.lock files.

Jonas · December 13, 2021, 12:30pm

@skshetry, thank you! But that won’t keep stages with changed cmd to be re-run, correct?

Topic		Replies	Views
Move a .dvc stage without re-run Questions	4	722	May 13, 2020
Remove/Redefine Stage Questions	12	2681	June 22, 2020
Dvc fails due to outs stage entry Questions	2	1159	November 19, 2021
Untracked vars/parameters Questions	1	134	February 26, 2024
Renaming a stage Questions	1	196	November 13, 2023

Adjusting absolute paths in stage command w/o triggering re-run

Related topics