Code dependency improvements

I have been using DVC for over a year and have a project which has grown to include multiple inter-dependent pipelines and many many experiments. One of the biggest inconveniences I’ve faced is unnecessary pipeline repros. I think the issue comes down to the fact that DVC takes a very general and high level view of dependencies, i.e. it just hashes the data and code in their entirety. If comments change DVC thinks the dependency has changed even though it hasn’t. The current way to deal with this is to do a dvc commit if you know the change didn’t impact a particular pipeline. However this can become pretty inconvenient when you have multiple pipelines which all depend on this file. Perhaps you could try to organize your code in such a way where the files only impact one pipeline or stage. This is also not ideal because generally people want to group together common functionality. For example you might have all of your “model” related code together in a file, so you have functions such as create_model, load_model, save_model, get_evaluation_model, etc. It makes sense to group these functions together in a file because they are likely to utilize the same set of utility functions like getting file paths for models. On the other hand, create_model is likely used in the training stage but not any other stage, load_model is probably used in an evaluation stage but not a training stage. So if I make a change to load_model DVC will want to reproduce all of my training stages and now I have to go though all of my training stages in all of my pipelines one by one and commit them. This feels like an inconvenient and error prone process.

I think it would be nice if there was plugins to support closer integration with the code languages. Python is probably what most people are using. This feature would make code dependencies more granular by specifying them e.g. at the function or class level. A Python parser would be able to distinguish changes made to comments and unrelated functions. Of course it would be pretty inconvenient to manually specify every function you use in every stage, so It would be cool if there was some sort of lightweight execution tracing mechanism which keeps track of what functions are called during a stage, then the function dependencies can be automatically determined. On a subsequent call to repro, those specific functions can be checked for changes rather then the the whole files. I think if all the functions haven’t changed and the imports haven’t changed then the stage outputs shouldn’t change either.

I haven’t thought out all the details of course. What about other languages? what about when python makes calls to compiled libraries? Is the execution tracer going to add significant overhead? Maybe this is the wrong approach to this problem. My overall message here is that as your project grows it becomes more inconvenient to deal with accidental dependency changes, and it seems like the root cause of the accidental dependency changes is the fact that DVC has an overly coarse view of code files.

Those are very good points, we all can relate and had similar ideas. We’ve recently started introducing somewhat related integrations for databases https://github.com/iterative/dvc/pull/10040 , but other than that unfortunately there are no plans to work on more flexible pipeline APIs any time soon.