How to manage files on remote repository

redonhk · April 28, 2020, 12:57am

Hi there,

First of all thanks for the great tool that really help manage ML practice!
Right now I’m setup repo in my local pc share folder and tried dvc for a while, my main usage now is to datafile versioning

My problem is how to handle the growth of files in the repository. It is one of showstopper for me to promote to team or larger scale usage …

If there is any mechanism to support keeping track latest 5 versions of file “X” or any statistic could tell how many files we have, how many version we save in repo? Checked tracked data files on repo they are compressed with hashed filename. It is hard to convert to back filename for analytics.

Any tool or better way to handle that? thanks in advance

Donald

jorgeorpinel · April 28, 2020, 1:46am

Hi Donald,

I’m just confused whether your question refers to the files in your local cache or in a remote storage (backup of the cache) but no problem. The short answer is that the only tool built-in DVC to manage cache/storage is the dvc gc command, AFAIK: gc

It usually garbage collects the local project cache, but applies to the DVC remote instead when using the --cloud option. The other 2 options I think could help you are --all-branches and --all-tags flags, which tell DVC to keep all files associated to any Git branch or tag, respectively.

If there is any mechanism to support keeping track latest 5 versions of file “X”?

I don’t think so. Would you mind sending a request so that dvc gc supports a new option to specify a number of recent versions to keep for all or certain data files? You can do so in Sign in to GitHub · GitHub

statistic could tell how many files we have, how many version we save in repo?

Same here, DVC doesn’t have a tool to report this right now although like you discovered, it could be discovered by exploring the commit history of each DVC-file — that shouldn’t be too difficult! But feel free to open another feature request for that, or to continue this conversation by asking in dvc, we’re pretty responsive over there.

Thanks

redonhk · April 29, 2020, 4:07am

Thanks Jorgeorpinal reply and advise.

I’m referring remote storage…e.g. hadoop, google-cloud etc… yes and gc seem only help on locally cache…without those tool. I’m a bit hestiate to provide an storage estimation on remote server for costing…

I will send request to suggest location, many thanks

jorgeorpinel · April 29, 2020, 4:23am

Sounds good. But please note that dvc gc works on remote storage using the --cloud option, like I mentioned. But it doesn’t have the specific capabilities you need directly.

Please let us know if you have further questions here or in the chat, and looking forward to see your feature request on GH!

Topic		Replies	Views
How to manage when remote is being updated Questions	4	437	May 25, 2021
Trying to understand data storage Questions	7	2283	October 30, 2022
How cache is maintained for big data size locally Questions	4	1230	May 15, 2020
Manage data with DVC in-repo, but place the data upon pull outside-repo? Questions	3	237	December 13, 2024
DVC compared with GitLFS for storage and versioning only Questions	12	6879	October 13, 2020

How to manage files on remote repository

Related topics