I have data projects where very large files change by only a few characters each new version. What disk usage expectations should users have?
Also, there are often cases where we have many files in a directory that look very similar. Are there optimizations by which chunks of data within each file are hashed and de-duplicated?
Currently dvc doesn’t do anything to user’s data, except deduplicating identical files. The advantage is that it is really easy to operate this way on full files, making dvc architecture really simple when it comes to the cache files. The con is that indeed in your scenario, you will be storing a full file for each version. If you don’t really need older versions on your local machine, you can use our dvc push command to push each version to the cloud and then use dvc gc command to remove local cache that is not used in the current HEAD of your project, but you will loose the ability to go back in your data versions without pulling data first.
Same with files in directory, we only manage full files and don’t do chunk deduplication yet. That being said, we are considering adding that feature and will be sure to implement it in the future. Here is a github issue to track the progress on that https://github.com/iterative/dvc/issues/829 .
Is there away to automatically remove files which are not checked out? We have a similar issue. Using DVC on a HPC cluster with multiple users would quadruple the space needed to store all the duplicates. Then those files would potentially go to a backup and then even more use disk down the line. We can probably exclude the cache directories from being backed up.