Dealing with Kaggle data

What is the best way to deal with outside data, such as a well known data set such as Imagenet or a Kaggle dataset?

More specifically, I want to download the data from the source as the first stage in the pipeline and be able to use them as dependencies for the next stage in the pipeline. But, I don’t want to upload those (potentially) huge data sets to my s3 bucket.

Hi @noenzyme !

There are a few ways you can go about it:

  1. If you don’t need your dataset at all but rather need some much smaller transformed version of it, you can combine your download and transform pipeline stages into a single stage, so that dvc itself doesn’t even know about the raw dataset, E.g.
dvc run -d download_and_transform.py -o transformed.data python download_and_transform.py
  1. If you do indeed need that raw data(or if your transform stage is not that deterministic that you could just glue it together with the download stage), but just don’t want to push the cache for it to your remote, you could simply tell dvc not to cache it in the download stage of your pipeline. E.g.
# Notice `-O|--outs-no-cache` flag, that tells dvc to not cache `raw.data`. Also note that you will need need to add it to .gitignore yourself.
dvc run -d download.py -O raw.data python download.py

Would something like that suit your scenario?

Thanks,
Ruslan

1 Like

Thanks @kupruser!

I hadn’t thought of using the code download.py as a dependency. Option number 2 will work nicely I’m thinking.

2 Likes