First off thanks for a great tool! While we don’t use the pipelines very much, we do use DVC to store our data in my team, several sets of 100s of GBs each.
I often work with one data set at the time, usually for several weeks. It would be nice to be able to clear the data sets not currently in use completely from my local machine. We use a monorepo for everything, and if I understand gc correctly, the reason my cache doesn’t get cleared is because there are .dvc files of all sets in the branch head.
Since there are some datasets I almost never use, it would be nice to be able to clear them completely from the local cache and then once i need it I’ll take my punishment and wait for it to download from the remote using dvc pull. My question, I guess, is if there is a nice way of doing this that I’m missing? Currently I’ve resorted to manually deleting everything i the cache folder every few months to start fresh and pull what I need. It works but doesn’t feel like the correct way to go about it.
Deleting the cache dir and then pulling what you want is a pretty good workaround But I suppose you want to be able to tell gc that you want to keep only some specific dvcfiles? Kinda like you dvc pull data right now.
Yeah hehe, it seems to work fine, it just didn’t seem like the proper way to go about it so I mainly wondered if I had missed something but I’m happy to stick with this then.
An additional option to gc would be nice, yes! Telling gc to keep only the listed targets in cache, regardless of whether there are .dvc files pointing to them or not. Maybe also check to only clear cache that have been pushed to the remote, but I don’t know if that is expensive to check.
Makes sense! We have a ticket https://github.com/iterative/dvc/issues/2036 for cleaning up local cache if it is already present on remote. And we have a more generic ticket about revisiting gc https://github.com/iterative/dvc/issues/2325 where we also discuss ways to cleanup particular data. Glad that the current workaround works fine for you, we’ll be adding an option for that in the future, so stay tuned Thanks for the feedback!