Hi, I am working with several quite large datasets, dvc pipelines and a few of big outputs they produce.
Let’s say that I have the following pipeline (pseudo-graph):
DVC imports DVC pipeline Pipeline outs
┌────────────────┐ ┌────────────────┐ ┌───────────────┐
│ │ │ │ │ │
│ │ │ Stage AB │ │ Output │
│ Import │ │ │ │ │
│ A ├───────►│ ├────────►│ AB │
│ │ │ │ │ │
│ │ │ │ │ │
└────────────────┘ └────────────────┘ └───────────────┘
▲
┌────────────────┐ │
│ │ │
│ │ │
│ Import │ │
│ B ├────────────────┘
│ │
│ │
└────────────────┘
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ │ │ │ │ │
│ │ │ │ │ Output │
│ Import ├───────►│ Stage C ├────────►│ │
│ C │ │ │ │ C │
│ │ │ │ │ │
│ │ │ │ │ │
└────────────────┘ └────────────────┘ └────────────────┘
I download cache for the imports (A, B, C) (dvc pull A.dvc B.dvc C.dvc
) and start working on the parts of DVC pipeline that reproduce the C outputs. After finishing my work, I run dvc repro
, commit everything and push DVC cache to remote. Now I want to focus on the part of the pipeline concerned with AB. However I do not need neither the import C, nor the outputs of the C stage. Unfortunately, they are taking my disk space by residing in the cache. Can I somehow remove C and its output from (local) cache but keep A, and B? Thanks for any help!