Best practice for python package dependency?

rabefabi · April 23, 2020, 11:26am

Hi all,

I’m figuring out how to structure my project and found your tutorials a very helpful introduction.

In them, however, you only rely on a single script for each stage, for example:

dvc run -d code/xml_to_tsv.py -d data/Posts.xml -o data/Posts.tsv \
          -f prepare.dvc \
          python code/xml_to_tsv.py data/Posts.xml data/Posts.tsv

Source

Now my understanding is the following:
If inside of xml_to_tsv.py are imports of some other of my libraries (e.g., a my_special_xml_importer.py), and I make changes to my_special_xml_importer.py, these changes would not be picked up by dvc, since my_special_xml_importer.py is not an explicit dependency of the stage, correct?

What’s the best practice here for bigger projects, where each DVC stage is not just contained in a single script?

Our use case will be the following: For each stage we’ll be having a jupyter notebook, which will import some of our python packages. I’m assuming I should create a stage like this:

dvc run -d my_notebook.ipynb -d code/my_lib.py -d data/Posts.xml -o data/Posts.tsv
  -f prepare.dvc
   papermill my_notebook.ipynb my_notebook_out.ipynb

Is this a good way, are there other ways, how are people with bigger projects dealing with this issue?

Thanks in advance,
Fabi

shcheklein · April 24, 2020, 12:33am

Hi @rabefabi!

The reasonable way is to put the code that is very specific to a stage into a separate directory:

dvc run -d train ... train/train.py ...

In this case, also use .dvcignore to exclude pycache.

I don’t know an easy way for DVC to analyze the actual graph of dependencies build and maintain them. It can be probably done with some custom scripts though.

rabefabi · April 27, 2020, 6:52am

Putting the notebooks in each stages subdirectory should work for us, thanks for the hint

Topic		Replies	Views
Best practices for specifying dependencies Questions	2	1043	September 12, 2018
Declare python version and packages versions as dependencies Questions	2	469	September 16, 2020
Code dependency improvements Feature Requests	1	214	November 28, 2023
Using DVC for end-to-end pipeline Questions	6	1610	January 5, 2019
Parameterlike dependencies Questions	12	1514	December 12, 2020

Best practice for python package dependency?

Related topics