I’m figuring out how to structure my project and found your tutorials a very helpful introduction.
In them, however, you only rely on a single script for each stage, for example:
dvc run -d code/xml_to_tsv.py -d data/Posts.xml -o data/Posts.tsv \ -f prepare.dvc \ python code/xml_to_tsv.py data/Posts.xml data/Posts.tsv
Now my understanding is the following:
If inside of
xml_to_tsv.py are imports of some other of my libraries (e.g., a
my_special_xml_importer.py), and I make changes to
my_special_xml_importer.py, these changes would not be picked up by dvc, since
my_special_xml_importer.py is not an explicit dependency of the stage, correct?
What’s the best practice here for bigger projects, where each DVC stage is not just contained in a single script?
Our use case will be the following: For each stage we’ll be having a jupyter notebook, which will import some of our python packages. I’m assuming I should create a stage like this:
dvc run -d my_notebook.ipynb -d code/my_lib.py -d data/Posts.xml -o data/Posts.tsv -f prepare.dvc papermill my_notebook.ipynb my_notebook_out.ipynb
Is this a good way, are there other ways, how are people with bigger projects dealing with this issue?
Thanks in advance,