What is the best way to deal with outside data, such as a well known data set such as Imagenet or a Kaggle dataset?
More specifically, I want to download the data from the source as the first stage in the pipeline and be able to use them as dependencies for the next stage in the pipeline. But, I don’t want to upload those (potentially) huge data sets to my s3 bucket.
Hi @noenzyme !
There are a few ways you can go about it:
- If you don’t need your dataset at all but rather need some much smaller transformed version of it, you can combine your
transform pipeline stages into a single stage, so that dvc itself doesn’t even know about the raw dataset, E.g.
dvc run -d download_and_transform.py -o transformed.data python download_and_transform.py
- If you do indeed need that raw data(or if your
transform stage is not that deterministic that you could just glue it together with the
download stage), but just don’t want to push the cache for it to your remote, you could simply tell dvc not to cache it in the
download stage of your pipeline. E.g.
# Notice `-O|--outs-no-cache` flag, that tells dvc to not cache `raw.data`. Also note that you will need need to add it to .gitignore yourself.
dvc run -d download.py -O raw.data python download.py
Would something like that suit your scenario?
I hadn’t thought of using the code
download.py as a dependency. Option number 2 will work nicely I’m thinking.