Removal of partial pull's

Dear DVC community,

I am rather new to DVC. I am interested to use DVC in a use case that is maybe related to what is called “Data Registrty”, but it is not entirely the same.

I do have a repository that stores in a structured way folders with outputs from expensive computational runs with many files and high storage volume. I added the individual folders as one object, i.e. have one .dvc file per folder.

Example:

repo
|- folder_a.dvc
|- folder_b.dvc
|- folder_c.dvc
|- folder_d.dvc

A user of the repository would first clone it and, thanks to your partial pull feature, only pull those directories from the remote that are necessary for the next steps in the data analysis pipeline. Thereby the huge repository, for which a full pull would not fit on standard storage, can still be properly used.

Example (cont.):
(result after “dvc pull folder_a.dvc folder_c.dvc”)

repo
|- folder_a.dvc
|- folder_a
|- folder_b.dvc
|- folder_c.dvc
|- folder_c
|- folder_d.dvc

Let’s assume now that the user stopped to use one of those folders (say “folder_a”) after having pushed its updates. Is there an obvious way to remove that partial pull (here “folder_a”) from the local working copy, without affecting the remote? (This would be done for keeping the local storage requirements on a moderate level.)

Example (cont.):
(result after applying the searched for operation)

repo
|- folder_a.dvc
|- folder_b.dvc
|- folder_c.dvc
|- folder_c
|- folder_d.dvc

I do not ask for removing data from the remote but rather a proper way to remove the folder (not it’s .dvc file) and all remaining data in the local cache. I assume here that a simple removal (“rm -fR folder_a”) of the folder would not be enough…

Then, the simplest solution would of course be to delete the local working directory and just clone a new one, where one would start from scratch with a partial pull. However, that might become a bit unhandy over time.

So is there any simple way to do this by a kind of “unpull” operation that I overlooked?

If you need further details, please let me know.

Thanks a lot in advance!

Hello again,

After way more reading, the above searched operation for the given example seems to be

rm -fR folder_a
dvc gc -w

Would that be correct?

Thanks!

Hi @zaspel. Welcome to the community.

There’s no straight way to do this. I can suggest a workaround:

dvc remove folder_a.dvc --outs
dvc gc -w
git checkout HEAD -- folder_a.dvc

This will delete the folder_a.dvc and folder_a directory temporarily, and then run gc to delete the contents in the cache.
After that, we checkout the folder_a.dvc back again. :slight_smile:

This is a simple usecase that we should support, feel free to create a feature request. Thanks.

Dear @skshetry,

thank you so much for this!! This will be of great help in our project!

I will immediately proceed to propose this as a feature.

Thanks again!