Hi - I’m fairly new to dvc and trying to apply it to a non-ML workflow that uses a config yaml files with file paths to a tracked data directory to build a model. I then have various stages within a pipeline to do optimization etc.
The problem I’m encountering is that if I change the contents of a file within the tracked directory, it initiates the build stage which is cumbersome and requires a lot of data processing. I was hoping that if I had an inventory of changed or modified files, I could then only have to do a partial build (as I also internally track dependencies between build components).
Is there a way to get a dvc diff within a pipeline? I know I could just run the dvc diff before the run and depend on the output file, but thought maybe another way?
It sounds like you might want to make the pipeline more granular so the stages can depend on individual files instead of the entire data directory. Or you may need to make your data processing script diff the repo itself to work out the incremental logic. It’s hard to imagine what’s going on without seeing the pipeline and how it works though. Are you able to share your repo?
Sometimes, if you make changes to a dependency, that you definetely know will (in this particular circumstance) not affect the outcome of your stage, you can use dvc commit to update the dvc.lock file.
For example, I often declare code as dependency for a stage. However, if I only change comments or linting, I do not need to rerun the stage. In this case dvc commit is useful.
My workflow is often to use dvc status to diagnose the changes and then decide if I can use dvc commit or if I need to rerun.
But I concur with @petebachant’s advice, to work on more granular stages.