I’m looking to integrate DVC in my already existing workflow and am looking for feedback on what would be the best use case for me.
I’ve an existing work flow that roughly works like,
- pulls a repo from git,
- builds a docker image, runs container,
- calls
python train.py
ormlflow run
the data in my case is usually resides on the disk. So, I create 3 text files for train, val, test each which is passed as input to the data loader.
I would like to improve on this by introducing DVC into this pipeline. Here are my questions,
-
Since my data resides locally, does it make sense for me to
dvc push
to remote and then download the same data again while training? This makes it 3 copies of the same data. Unfortunately, I don’t have the luxury of cloud storage, so even the “remote” push will reside on the same disk or some disk on the network. In addition to this, I am interested in knowing if this approach would always create atleast 3 copies, or are there scenarios where one can have less than 3 copies. (I do understand the remote version works as backup and for sharing data across teams) -
Each time I tried
dvc pull
(with or without remote), it downloads the data to the git working directory. I tried modifying theouts
→path
to a path outside the working directory, but it fails on me. I’d like to do this so that I don’t have to “download” data for each time I launch an experiment (flow described above). I’d like for the experiment run to pull data only if there’s new/difference in data.
For now, what I have mind is 2 separate workflows using 2 different repos. One for handling data using DVC and another for ML experiments (described above). The first repo would handle processing of data and it’s checked-in. After checking in I do a git pull
, dvc pull
to have the data downloaded. Once downloaded, I can use my existing workflow described above.
This is what I have mind. It feels like I’m missing something so, wanted to reach out to the community to see if I should be approaching this differently .
Thanks