Dear DVC developers,
the following feature request is a follow-up of the thread
The feature that I would love to see is kind of the opposite of a partial pull:
Let’s assume we have a repository with many folders which contain huge data sets. The repository is like an archive for the data in these folders, however, it is clearly way too large to be fully pulled. Each folder is added via a single .dvc file:
repo
|- folder_a.dvc
|- folder_b.dvc
|- folder_c.dvc
|- folder_d.dvc
A user of the repository would first clone it and, thanks to your partial pull feature, only pull those directories from the remote that are necessary for the next steps in the data analysis pipeline. Thereby the huge repository, for which a full pull would not fit on standard storage, can still be properly used.
Example (cont.):
(result after “dvc pull folder_a.dvc folder_c.dvc”)
repo
|- folder_a.dvc
|- folder_a
|- folder_b.dvc
|- folder_c.dvc
|- folder_c
|- folder_d.dvc
Let’s assume now that the user stopped to use one of those folders (say “folder_a”) after having pushed its updates.
The feature that I would love to see is a way to remove that partial pull (here “folder_a”) from the local working copy, without affecting the remote. Hence, once the user is done with using the huge folder, she/he could get rid of it.
Example (cont.):
(result after applying the requested operation)
repo
|- folder_a.dvc
|- folder_b.dvc
|- folder_c.dvc
|- folder_c
|- folder_d.dvc
From @skshetry, I received the great workaround of doing
dvc remove folder_a.dvc --outs
dvc gc -w
git checkout HEAD -- folder_a.dvc
which would precisely implement, what I would look for. However, It would be fantastic to have this as a “partial push with local removal” or something like this. This would greatly facilitate the use of working copies on smaller storage systems while dealing with huge repositories.
I hope this feature will find some friends out there!
Thanks a lot!