I am using DVC on macOS. According to the documentation, DVC won’t save the same file twice (one in the workspace and one in the cache).
In order to have the files present in both directories without duplication, DVC can automatically create **file links** to the cached data in the workspace. In fact, by default it will attempt to use reflinks* if supported by the file system.
However, for me, it keeps two copies of the same file, thereby increasing the disk space significantly. For example, if I have a file of around 5GB, the disk space for that project goes to around 10GB. Here is the output of the dvc version command. According to the docs, the reflinks are supported by macOS and can be seen from dvc version output then why isn’t it working in my case? Any ideas on what I am missing here?
@drsb@lcs72
I am afraid that the problem is with the system rather than with DVC.
Let me explain that:
here is a test_links.sh script that will help you understand:
We create 1MB file and try out different link types.
As we can see all of them has been created in different second (that’s why I put sleep there).
Lets focus on reflink - looking at stat, it yields that only inode and changed time is different than data - its birthtime is from before we actually created the reflink - that’s because its a reflink. The problem that you are facing is that system doesn’t have a simple way of summarizing that - hence we get 3MB in du -hd1 - 1 MB for data/data_copy/data_reflink (in ls -alh you can see that symlink takes only 4B). If you want to check that hardlink does not count, just comment out its creation - du result will not change.
To summarize - if you want your system to pick up the fact that you are using links you would need to go for hardlinks or symlinks. To be sure that DVC will first try all links before going for copy you can run dvc config cache.type "reflink,symlink,hardlink,copy".
Thank you for this command. I ran it, rm’d the files in my workspace, then dvc checkout brought them back as symlinks.
I expected something like this to happen by default (on Amazon Linux[1]), and was surprised dvc was using 2x disk space (and not linking in some fashion by default).
Maybe dvc should emit a warning when copy results in a large amount of duplication (perhaps as a % of current disk size), and a heads-up that this parameter exists?
[1]: AMI ami-007855ac798b5175e: ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20230325