Best practice for python package dependency?

Hi all,

I’m figuring out how to structure my project and found your tutorials a very helpful introduction.

In them, however, you only rely on a single script for each stage, for example:

dvc run -d code/ -d data/Posts.xml -o data/Posts.tsv \
          -f prepare.dvc \
          python code/ data/Posts.xml data/Posts.tsv


Now my understanding is the following:
If inside of are imports of some other of my libraries (e.g., a, and I make changes to, these changes would not be picked up by dvc, since is not an explicit dependency of the stage, correct?

What’s the best practice here for bigger projects, where each DVC stage is not just contained in a single script?

Our use case will be the following: For each stage we’ll be having a jupyter notebook, which will import some of our python packages. I’m assuming I should create a stage like this:

dvc run -d my_notebook.ipynb -d code/ -d data/Posts.xml -o data/Posts.tsv
  -f prepare.dvc
   papermill my_notebook.ipynb my_notebook_out.ipynb

Is this a good way, are there other ways, how are people with bigger projects dealing with this issue?

Thanks in advance,

Hi @rabefabi!

The reasonable way is to put the code that is very specific to a stage into a separate directory:

dvc run -d train ... train/ ...

In this case, also use .dvcignore to exclude pycache.

I don’t know an easy way for DVC to analyze the actual graph of dependencies build and maintain them. It can be probably done with some custom scripts though.

1 Like

Putting the notebooks in each stages subdirectory should work for us, thanks for the hint

1 Like