Looking for Workflow Suggestion

I’m looking to integrate DVC in my already existing workflow and am looking for feedback on what would be the best use case for me.

I’ve an existing work flow that roughly works like,

  1. pulls a repo from git,
  2. builds a docker image, runs container,
  3. calls python train.py or mlflow run

the data in my case is usually resides on the disk. So, I create 3 text files for train, val, test each which is passed as input to the data loader.

I would like to improve on this by introducing DVC into this pipeline. Here are my questions,

  1. Since my data resides locally, does it make sense for me to dvc push to remote and then download the same data again while training? This makes it 3 copies of the same data. Unfortunately, I don’t have the luxury of cloud storage, so even the “remote” push will reside on the same disk or some disk on the network. In addition to this, I am interested in knowing if this approach would always create atleast 3 copies, or are there scenarios where one can have less than 3 copies. (I do understand the remote version works as backup and for sharing data across teams)

  2. Each time I tried dvc pull (with or without remote), it downloads the data to the git working directory. I tried modifying the outspath to a path outside the working directory, but it fails on me. I’d like to do this so that I don’t have to “download” data for each time I launch an experiment (flow described above). I’d like for the experiment run to pull data only if there’s new/difference in data.

For now, what I have mind is 2 separate workflows using 2 different repos. One for handling data using DVC and another for ML experiments (described above). The first repo would handle processing of data and it’s checked-in. After checking in I do a git pull, dvc pull to have the data downloaded. Once downloaded, I can use my existing workflow described above.

This is what I have mind. It feels like I’m missing something so, wanted to reach out to the community to see if I should be approaching this differently .

Thanks

Hi, where is that data stored? Is it always going to be in a single location?
Can you set cache directory to that location and attach the volume to the container?

dvc cache dir path/to/cache/dir
  1. If you don’t need a remote, you don’t need to set one.
  2. You can run dvc exp run, which won’t pull since you have everything in cache (and you don’t have any remote). It’ll create data in your workspace from the cache automatically that’s required for your stages.

Also note that you can share the same cache among many DVC repositories.

thanks for your response.

yes, for every project it will be stored in one location.

to do this, do I still need to create a repo?

  1. Understood. But am I right in saying there will be atleast 3 copies?
  2. I plan to use mlflow or simple python to launch training. I want to try and avoid copying data to workspace each time I run training. Is this inevitable using dvc? My datasets are 50/60Gb and it takes along time to copy over network to my local storage