Clear local cache completely and rely on remote

davlid · December 18, 2020, 11:47am

Hi!

First off thanks for a great tool! While we don’t use the pipelines very much, we do use DVC to store our data in my team, several sets of 100s of GBs each.

I often work with one data set at the time, usually for several weeks. It would be nice to be able to clear the data sets not currently in use completely from my local machine. We use a monorepo for everything, and if I understand gc correctly, the reason my cache doesn’t get cleared is because there are .dvc files of all sets in the branch head.

Since there are some datasets I almost never use, it would be nice to be able to clear them completely from the local cache and then once i need it I’ll take my punishment and wait for it to download from the remote using dvc pull. My question, I guess, is if there is a nice way of doing this that I’m missing? Currently I’ve resorted to manually deleting everything i the cache folder every few months to start fresh and pull what I need. It works but doesn’t feel like the correct way to go about it.

All the best!

kupruser · December 18, 2020, 3:08pm

Hi @davlid !

Deleting the cache dir and then pulling what you want is a pretty good workaround But I suppose you want to be able to tell gc that you want to keep only some specific dvcfiles? Kinda like you dvc pull data right now.

davlid · December 18, 2020, 6:27pm

Hi!

Yeah hehe, it seems to work fine, it just didn’t seem like the proper way to go about it so I mainly wondered if I had missed something but I’m happy to stick with this then.

An additional option to gc would be nice, yes! Telling gc to keep only the listed targets in cache, regardless of whether there are .dvc files pointing to them or not. Maybe also check to only clear cache that have been pushed to the remote, but I don’t know if that is expensive to check.

Many thanks!

kupruser · December 18, 2020, 8:10pm

Makes sense! We have a ticket https://github.com/iterative/dvc/issues/2036 for cleaning up local cache if it is already present on remote. And we have a more generic ticket about revisiting gc https://github.com/iterative/dvc/issues/2325 where we also discuss ways to cleanup particular data. Glad that the current workaround works fine for you, we’ll be adding an option for that in the future, so stay tuned Thanks for the feedback!

Topic		Replies	Views
Pull Data only when necessary and delete it afterwards Questions	1	239	March 9, 2023
Is it possible to remove old versions of data files? Questions	1	1795	November 26, 2019
How to remove cache for specific targets/imports? Questions	0	227	April 6, 2023
Removal of partial pull's Questions	3	571	July 30, 2022
Dvc gc command doesn't work Questions	6	429	February 25, 2024

Clear local cache completely and rely on remote

Related topics