Hi @tonylib !
Currently dvc doesn’t do anything to user’s data, except deduplicating identical files. The advantage is that it is really easy to operate this way on full files, making dvc architecture really simple when it comes to the cache files. The con is that indeed in your scenario, you will be storing a full file for each version. If you don’t really need older versions on your local machine, you can use our
dvc push command to push each version to the cloud and then use
dvc gc command to remove local cache that is not used in the current HEAD of your project, but you will loose the ability to go back in your data versions without pulling data first.
Same with files in directory, we only manage full files and don’t do chunk deduplication yet. That being said, we are considering adding that feature and will be sure to implement it in the future. Here is a github issue to track the progress on that https://github.com/iterative/dvc/issues/829 .