Dear DVC community,
I am rather new to DVC. I am interested to use DVC in a use case that is maybe related to what is called “Data Registrty”, but it is not entirely the same.
I do have a repository that stores in a structured way folders with outputs from expensive computational runs with many files and high storage volume. I added the individual folders as one object, i.e. have one .dvc file per folder.
repo |- folder_a.dvc |- folder_b.dvc |- folder_c.dvc |- folder_d.dvc
A user of the repository would first clone it and, thanks to your partial pull feature, only pull those directories from the remote that are necessary for the next steps in the data analysis pipeline. Thereby the huge repository, for which a full pull would not fit on standard storage, can still be properly used.
(result after “dvc pull folder_a.dvc folder_c.dvc”)
repo |- folder_a.dvc |- folder_a |- folder_b.dvc |- folder_c.dvc |- folder_c |- folder_d.dvc
Let’s assume now that the user stopped to use one of those folders (say “folder_a”) after having pushed its updates. Is there an obvious way to remove that partial pull (here “folder_a”) from the local working copy, without affecting the remote? (This would be done for keeping the local storage requirements on a moderate level.)
(result after applying the searched for operation)
repo |- folder_a.dvc |- folder_b.dvc |- folder_c.dvc |- folder_c |- folder_d.dvc
I do not ask for removing data from the remote but rather a proper way to remove the folder (not it’s .dvc file) and all remaining data in the local cache. I assume here that a simple removal (“rm -fR folder_a”) of the folder would not be enough…
Then, the simplest solution would of course be to delete the local working directory and just clone a new one, where one would start from scratch with a partial pull. However, that might become a bit unhandy over time.
So is there any simple way to do this by a kind of “unpull” operation that I overlooked?
If you need further details, please let me know.
Thanks a lot in advance!