Redundant Data across version and within versions

tonylib · August 7, 2018, 3:46am

I have data projects where very large files change by only a few characters each new version. What disk usage expectations should users have?

Also, there are often cases where we have many files in a directory that look very similar. Are there optimizations by which chunks of data within each file are hashed and de-duplicated?

Thanks

kupruser · August 7, 2018, 1:25pm

Hi @tonylib !

Currently dvc doesn’t do anything to user’s data, except deduplicating identical files. The advantage is that it is really easy to operate this way on full files, making dvc architecture really simple when it comes to the cache files. The con is that indeed in your scenario, you will be storing a full file for each version. If you don’t really need older versions on your local machine, you can use our dvc push command to push each version to the cloud and then use dvc gc command to remove local cache that is not used in the current HEAD of your project, but you will loose the ability to go back in your data versions without pulling data first.

Same with files in directory, we only manage full files and don’t do chunk deduplication yet. That being said, we are considering adding that feature and will be sure to implement it in the future. Here is a github issue to track the progress on that https://github.com/iterative/dvc/issues/829 .

Thanks,
Ruslan

rmbzmb · July 22, 2022, 3:20pm

Is there away to automatically remove files which are not checked out? We have a similar issue. Using DVC on a HPC cluster with multiple users would quadruple the space needed to store all the duplicates. Then those files would potentially go to a backup and then even more use disk down the line. We can probably exclude the cache directories from being backed up.

dtrifiro · July 23, 2022, 3:16pm

Hey @rmbzmb,
You might want to check out a shared-cache setup: How to Share a Cache Among Projects

If possible, please try avoid bumping 4 year old threads, thanks!

AquaSolu · February 2, 2023, 9:57pm

Thank you for posting something like this.

Topic		Replies	Views
Shared development details Questions	2	842	December 9, 2020
Using DVC without Cache (file references only) Questions	6	2842	August 26, 2021
Is it possible to remove old versions of data files? Questions	1	1785	November 26, 2019
Total data usage and use of external shared cache Questions	1	381	May 26, 2023
Manage data from one dvc folder with colleagues Questions	2	335	July 3, 2022

Redundant Data across version and within versions

Related topics