Can dvc output modified files (similar to dvc diff) within a pipeline?

taubs · April 14, 2025, 11:27am

Hi - I’m fairly new to dvc and trying to apply it to a non-ML workflow that uses a config yaml files with file paths to a tracked data directory to build a model. I then have various stages within a pipeline to do optimization etc.

The problem I’m encountering is that if I change the contents of a file within the tracked directory, it initiates the build stage which is cumbersome and requires a lot of data processing. I was hoping that if I had an inventory of changed or modified files, I could then only have to do a partial build (as I also internally track dependencies between build components).

Is there a way to get a dvc diff within a pipeline? I know I could just run the dvc diff before the run and depend on the output file, but thought maybe another way?

petebachant · April 18, 2025, 5:39pm

It sounds like you might want to make the pipeline more granular so the stages can depend on individual files instead of the entire data directory. Or you may need to make your data processing script diff the repo itself to work out the incremental logic. It’s hard to imagine what’s going on without seeing the pipeline and how it works though. Are you able to share your repo?

mfakaehler · April 25, 2025, 10:34am

Sometimes, if you make changes to a dependency, that you definetely know will (in this particular circumstance) not affect the outcome of your stage, you can use dvc commit to update the dvc.lock file.
For example, I often declare code as dependency for a stage. However, if I only change comments or linting, I do not need to rerun the stage. In this case dvc commit is useful.
My workflow is often to use dvc status to diagnose the changes and then decide if I can use dvc commit or if I need to rerun.
But I concur with @petebachant’s advice, to work on more granular stages.

Topic		Replies	Views
Whole directory as input or output Questions	4	2121	September 27, 2019
Need to build non-ML data pipeline, is DVC good fit? Questions	7	1179	August 24, 2021
Using DVC for end-to-end pipeline Questions	6	1614	January 5, 2019
Versioning predictions Questions	7	954	February 10, 2021
Simplifying `dvc run` and pipelines Feature Requests	3	1751	August 29, 2019

Can dvc output modified files (similar to dvc diff) within a pipeline?

Related topics