How to avoid data duplication between cache and workspace

I am using DVC on macOS. According to the documentation, DVC won’t save the same file twice (one in the workspace and one in the cache).

In order to have the files present in both directories without duplication, DVC can automatically create **file links** to the cached data in the workspace. In fact, by default it will attempt to use reflinks* if supported by the file system.

However, for me, it keeps two copies of the same file, thereby increasing the disk space significantly. For example, if I have a file of around 5GB, the disk space for that project goes to around 10GB. Here is the output of the dvc version command. According to the docs, the reflinks are supported by macOS and can be seen from dvc version output then why isn’t it working in my case? Any ideas on what I am missing here?

DVC version: 2.18.0 (brew)
---------------------------------
Platform: Python 3.10.6 on macOS-12.5.1-x86_64-i386-64bit
Supports:
        azure (adlfs = 2022.7.0, knack = 0.9.0, azure-identity = 1.10.0),
        gdrive (pydrive2 = 1.14.0),
        gs (gcsfs = 2022.7.1),
        webhdfs (fsspec = 2022.7.1),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2022.7.1, boto3 = 1.21.21),
        ssh (sshfs = 2022.6.0),
        oss (ossfs = 2021.8.0),
        webdav (webdav4 = 0.9.7),
        webdavs (webdav4 = 0.9.7)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5s1
Caches: local
Remotes: s3
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc (subdir), git

I have the same issue using Ubuntu.

@drsb @lcs72
I am afraid that the problem is with the system rather than with DVC.
Let me explain that:
here is a test_links.sh script that will help you understand:

#!/bin/bash

set -xu
pushd $TMPDIR

wsp=test_wspace
rep=test_repo

rm -rf $wsp && mkdir $wsp && pushd $wsp
main=$(pwd)

mkdir $rep && pushd $rep

dd if=/dev/urandom of=data bs=1M count=1

sleep 1

cp data data_copy

sleep 1
python -c "from dvc.fs import system;system.symlink('data', 'data_symlink')"

sleep 1
python -c "from dvc.fs import system;system.hardlink('data', 'data_hardlink')"

sleep 1
python -c "from dvc.fs import system;system.reflink('data', 'data_reflink')"

stat -f "inode: %i access: %a modification: %m changed: %c birth: %B %N" data data_copy data_symlink data_hardlink data_reflink


du -hd1
ls -alh

The result:

+ stat -f 'inode: %i access: %a modification: %m changed: %c birth: %B %N' data data_copy data_symlink data_hardlink data_reflink
inode: 139026482 access: 1663847851 modification: 1663847851 changed: 1663847855 birth: 1663847851 data
inode: 139026485 access: 1663847852 modification: 1663847852 changed: 1663847852 birth: 1663847852 data_copy
inode: 139026488 access: 1663847853 modification: 1663847853 changed: 1663847853 birth: 1663847853 data_symlink
inode: 139026482 access: 1663847851 modification: 1663847851 changed: 1663847855 birth: 1663847851 data_hardlink
inode: 139026495 access: 1663847851 modification: 1663847851 changed: 1663847856 birth: 1663847851 data_reflink
+ du -hd1
3.0M    .
+ ls -alh
total 8192
drwxr-xr-x  7 pawelredzynski  staff   224B Sep 22 13:57 .
drwxr-xr-x  3 pawelredzynski  staff    96B Sep 22 13:57 ..
-rw-r--r--  2 pawelredzynski  staff   1.0M Sep 22 13:57 data
-rw-r--r--  1 pawelredzynski  staff   1.0M Sep 22 13:57 data_copy
-rw-r--r--  2 pawelredzynski  staff   1.0M Sep 22 13:57 data_hardlink
-rw-r--r--  1 pawelredzynski  staff   1.0M Sep 22 13:57 data_reflink
lrwxr-xr-x  1 pawelredzynski  staff     4B Sep 22 13:57 data_symlink -> data

We create 1MB file and try out different link types.
As we can see all of them has been created in different second (that’s why I put sleep there).
Lets focus on reflink - looking at stat, it yields that only inode and changed time is different than data - its birthtime is from before we actually created the reflink - that’s because its a reflink. The problem that you are facing is that system doesn’t have a simple way of summarizing that - hence we get 3MB in du -hd1 - 1 MB for data/data_copy/data_reflink (in ls -alh you can see that symlink takes only 4B). If you want to check that hardlink does not count, just comment out its creation - du result will not change.

To summarize - if you want your system to pick up the fact that you are using links you would need to go for hardlinks or symlinks. To be sure that DVC will first try all links before going for copy you can run dvc config cache.type "reflink,symlink,hardlink,copy".

1 Like

Thank you for this command. I ran it, rm’d the files in my workspace, then dvc checkout brought them back as symlinks.

I expected something like this to happen by default (on Amazon Linux[1]), and was surprised dvc was using 2x disk space (and not linking in some fashion by default).

Here’s a good docs link: Configuring DVC cache file link type, also dvc config cache.type.

Maybe dvc should emit a warning when copy results in a large amount of duplication (perhaps as a % of current disk size), and a heads-up that this parameter exists?

[1]: AMI ami-007855ac798b5175e: ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20230325

Sorry, that’s out of scope for us. One could easilly just check dvc doctor and see what link types are available.