Best practice for python package dependency?

Hi all,

I’m figuring out how to structure my project and found your tutorials a very helpful introduction.

In them, however, you only rely on a single script for each stage, for example:

dvc run -d code/xml_to_tsv.py -d data/Posts.xml -o data/Posts.tsv \
          -f prepare.dvc \
          python code/xml_to_tsv.py data/Posts.xml data/Posts.tsv

Source

Now my understanding is the following:
If inside of xml_to_tsv.py are imports of some other of my libraries (e.g., a my_special_xml_importer.py), and I make changes to my_special_xml_importer.py, these changes would not be picked up by dvc, since my_special_xml_importer.py is not an explicit dependency of the stage, correct?

What’s the best practice here for bigger projects, where each DVC stage is not just contained in a single script?

Our use case will be the following: For each stage we’ll be having a jupyter notebook, which will import some of our python packages. I’m assuming I should create a stage like this:

dvc run -d my_notebook.ipynb -d code/my_lib.py -d data/Posts.xml -o data/Posts.tsv
  -f prepare.dvc
   papermill my_notebook.ipynb my_notebook_out.ipynb

Is this a good way, are there other ways, how are people with bigger projects dealing with this issue?

Thanks in advance,
Fabi

Hi @rabefabi!

The reasonable way is to put the code that is very specific to a stage into a separate directory:

dvc run -d train ... train/train.py ...

In this case, also use .dvcignore to exclude pycache.

I don’t know an easy way for DVC to analyze the actual graph of dependencies build and maintain them. It can be probably done with some custom scripts though.

1 Like

Putting the notebooks in each stages subdirectory should work for us, thanks for the hint

1 Like