Hi all,
I’m figuring out how to structure my project and found your tutorials a very helpful introduction.
In them, however, you only rely on a single script for each stage, for example:
dvc run -d code/xml_to_tsv.py -d data/Posts.xml -o data/Posts.tsv \
-f prepare.dvc \
python code/xml_to_tsv.py data/Posts.xml data/Posts.tsv
Now my understanding is the following:
If inside of xml_to_tsv.py
are imports of some other of my libraries (e.g., a my_special_xml_importer.py
), and I make changes to my_special_xml_importer.py
, these changes would not be picked up by dvc, since my_special_xml_importer.py
is not an explicit dependency of the stage, correct?
What’s the best practice here for bigger projects, where each DVC stage is not just contained in a single script?
Our use case will be the following: For each stage we’ll be having a jupyter notebook, which will import some of our python packages. I’m assuming I should create a stage like this:
dvc run -d my_notebook.ipynb -d code/my_lib.py -d data/Posts.xml -o data/Posts.tsv
-f prepare.dvc
papermill my_notebook.ipynb my_notebook_out.ipynb
Is this a good way, are there other ways, how are people with bigger projects dealing with this issue?
Thanks in advance,
Fabi