Dealing with Kaggle data

noenzyme · November 2, 2018, 3:29am

What is the best way to deal with outside data, such as a well known data set such as Imagenet or a Kaggle dataset?

More specifically, I want to download the data from the source as the first stage in the pipeline and be able to use them as dependencies for the next stage in the pipeline. But, I don’t want to upload those (potentially) huge data sets to my s3 bucket.

kupruser · November 2, 2018, 10:55am

Hi @noenzyme !

There are a few ways you can go about it:

If you don’t need your dataset at all but rather need some much smaller transformed version of it, you can combine your download and transform pipeline stages into a single stage, so that dvc itself doesn’t even know about the raw dataset, E.g.

dvc run -d download_and_transform.py -o transformed.data python download_and_transform.py

If you do indeed need that raw data(or if your transform stage is not that deterministic that you could just glue it together with the download stage), but just don’t want to push the cache for it to your remote, you could simply tell dvc not to cache it in the download stage of your pipeline. E.g.

# Notice `-O|--outs-no-cache` flag, that tells dvc to not cache `raw.data`. Also note that you will need need to add it to .gitignore yourself.
dvc run -d download.py -O raw.data python download.py

Would something like that suit your scenario?

Thanks,
Ruslan

noenzyme · November 2, 2018, 5:03pm

Thanks @kupruser!

I hadn’t thought of using the code download.py as a dependency. Option number 2 will work nicely I’m thinking.

Topic		Replies	Views
Handle cache to keep the latest version of data Questions	2	851	August 31, 2018
Working on remote data Questions	1	141	April 29, 2024
Is it Possible to train data in s3 bucket without downloading to local machine with DVC? Questions	11	761	March 22, 2023
Best practice for handling large data Questions	5	2573	April 16, 2021
Access remote data instead of downloading it Questions	8	665	March 3, 2023

Dealing with Kaggle data

Related topics