Hi @noenzyme !
There are a few ways you can go about it:
- If you don’t need your dataset at all but rather need some much smaller transformed version of it, you can combine your
transform pipeline stages into a single stage, so that dvc itself doesn’t even know about the raw dataset, E.g.
dvc run -d download_and_transform.py -o transformed.data python download_and_transform.py
- If you do indeed need that raw data(or if your
transform stage is not that deterministic that you could just glue it together with the
download stage), but just don’t want to push the cache for it to your remote, you could simply tell dvc not to cache it in the
download stage of your pipeline. E.g.
# Notice `-O|--outs-no-cache` flag, that tells dvc to not cache `raw.data`. Also note that you will need need to add it to .gitignore yourself.
dvc run -d download.py -O raw.data python download.py
Would something like that suit your scenario?