First of all thanks for the great tool that really help manage ML practice!
Right now I’m setup repo in my local pc share folder and tried dvc for a while, my main usage now is to datafile versioning
My problem is how to handle the growth of files in the repository. It is one of showstopper for me to promote to team or larger scale usage …
If there is any mechanism to support keeping track latest 5 versions of file “X” or any statistic could tell how many files we have, how many version we save in repo? Checked tracked data files on repo they are compressed with hashed filename. It is hard to convert to back filename for analytics.
Any tool or better way to handle that? thanks in advance
I’m just confused whether your question refers to the files in your local cache or in a remote storage (backup of the cache) but no problem. The short answer is that the only tool built-in DVC to manage cache/storage is the
dvc gc command, AFAIK: https://dvc.org/doc/command-reference/gc
It usually garbage collects the local project cache, but applies to the DVC remote instead when using the
--cloud option. The other 2 options I think could help you are
--all-tags flags, which tell DVC to keep all files associated to any Git branch or tag, respectively.
If there is any mechanism to support keeping track latest 5 versions of file “X”?
I don’t think so. Would you mind sending a request so that
dvc gc supports a new option to specify a number of recent versions to keep for all or certain data files? You can do so in https://github.com/iterative/dvc/issues/new/choose
statistic could tell how many files we have, how many version we save in repo?
Same here, DVC doesn’t have a tool to report this right now although like you discovered, it could be discovered by exploring the commit history of each DVC-file — that shouldn’t be too difficult! But feel free to open another feature request for that, or to continue this conversation by asking in dvc.org/chat, we’re pretty responsive over there.
Thanks Jorgeorpinal reply and advise.
I’m referring remote storage…e.g. hadoop, google-cloud etc… yes and gc seem only help on locally cache…without those tool. I’m a bit hestiate to provide an storage estimation on remote server for costing…
I will send request to suggest location, many thanks
Sounds good. But please note that
dvc gc works on remote storage using the
--cloud option, like I mentioned. But it doesn’t have the specific capabilities you need directly.
Please let us know if you have further questions here or in the chat, and looking forward to see your feature request on GH!