How to avoid data duplication between cache and workspace

drsb · September 14, 2022, 7:45am

I am using DVC on macOS. According to the documentation, DVC won’t save the same file twice (one in the workspace and one in the cache).

In order to have the files present in both directories without duplication, DVC can automatically create **file links** to the cached data in the workspace. In fact, by default it will attempt to use reflinks* if supported by the file system.

However, for me, it keeps two copies of the same file, thereby increasing the disk space significantly. For example, if I have a file of around 5GB, the disk space for that project goes to around 10GB. Here is the output of the dvc version command. According to the docs, the reflinks are supported by macOS and can be seen from dvc version output then why isn’t it working in my case? Any ideas on what I am missing here?

DVC version: 2.18.0 (brew)
---------------------------------
Platform: Python 3.10.6 on macOS-12.5.1-x86_64-i386-64bit
Supports:
        azure (adlfs = 2022.7.0, knack = 0.9.0, azure-identity = 1.10.0),
        gdrive (pydrive2 = 1.14.0),
        gs (gcsfs = 2022.7.1),
        webhdfs (fsspec = 2022.7.1),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2022.7.1, boto3 = 1.21.21),
        ssh (sshfs = 2022.6.0),
        oss (ossfs = 2021.8.0),
        webdav (webdav4 = 0.9.7),
        webdavs (webdav4 = 0.9.7)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5s1
Caches: local
Remotes: s3
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc (subdir), git

lcs72 · September 22, 2022, 10:48am

I have the same issue using Ubuntu.

Paffciu · September 22, 2022, 12:05pm

@drsb @lcs72
I am afraid that the problem is with the system rather than with DVC.
Let me explain that:
here is a test_links.sh script that will help you understand:

#!/bin/bash

set -xu
pushd $TMPDIR

wsp=test_wspace
rep=test_repo

rm -rf $wsp && mkdir $wsp && pushd $wsp
main=$(pwd)

mkdir $rep && pushd $rep

dd if=/dev/urandom of=data bs=1M count=1

sleep 1

cp data data_copy

sleep 1
python -c "from dvc.fs import system;system.symlink('data', 'data_symlink')"

sleep 1
python -c "from dvc.fs import system;system.hardlink('data', 'data_hardlink')"

sleep 1
python -c "from dvc.fs import system;system.reflink('data', 'data_reflink')"

stat -f "inode: %i access: %a modification: %m changed: %c birth: %B %N" data data_copy data_symlink data_hardlink data_reflink


du -hd1
ls -alh

The result:

+ stat -f 'inode: %i access: %a modification: %m changed: %c birth: %B %N' data data_copy data_symlink data_hardlink data_reflink
inode: 139026482 access: 1663847851 modification: 1663847851 changed: 1663847855 birth: 1663847851 data
inode: 139026485 access: 1663847852 modification: 1663847852 changed: 1663847852 birth: 1663847852 data_copy
inode: 139026488 access: 1663847853 modification: 1663847853 changed: 1663847853 birth: 1663847853 data_symlink
inode: 139026482 access: 1663847851 modification: 1663847851 changed: 1663847855 birth: 1663847851 data_hardlink
inode: 139026495 access: 1663847851 modification: 1663847851 changed: 1663847856 birth: 1663847851 data_reflink
+ du -hd1
3.0M    .
+ ls -alh
total 8192
drwxr-xr-x  7 pawelredzynski  staff   224B Sep 22 13:57 .
drwxr-xr-x  3 pawelredzynski  staff    96B Sep 22 13:57 ..
-rw-r--r--  2 pawelredzynski  staff   1.0M Sep 22 13:57 data
-rw-r--r--  1 pawelredzynski  staff   1.0M Sep 22 13:57 data_copy
-rw-r--r--  2 pawelredzynski  staff   1.0M Sep 22 13:57 data_hardlink
-rw-r--r--  1 pawelredzynski  staff   1.0M Sep 22 13:57 data_reflink
lrwxr-xr-x  1 pawelredzynski  staff     4B Sep 22 13:57 data_symlink -> data

We create 1MB file and try out different link types.
As we can see all of them has been created in different second (that’s why I put sleep there).
Lets focus on reflink - looking at stat, it yields that only inode and changed time is different than data - its birthtime is from before we actually created the reflink - that’s because its a reflink. The problem that you are facing is that system doesn’t have a simple way of summarizing that - hence we get 3MB in du -hd1 - 1 MB for data/data_copy/data_reflink (in ls -alh you can see that symlink takes only 4B). If you want to check that hardlink does not count, just comment out its creation - du result will not change.

To summarize - if you want your system to pick up the fact that you are using links you would need to go for hardlinks or symlinks. To be sure that DVC will first try all links before going for copy you can run dvc config cache.type "reflink,symlink,hardlink,copy".

runsascoded · October 21, 2023, 10:15pm

Thank you for this command. I ran it, rm’d the files in my workspace, then dvc checkout brought them back as symlinks.

I expected something like this to happen by default (on Amazon Linux[1]), and was surprised dvc was using 2x disk space (and not linking in some fashion by default).

Here’s a good docs link: Configuring DVC cache file link type, also dvc config cache.type.

Maybe dvc should emit a warning when copy results in a large amount of duplication (perhaps as a % of current disk size), and a heads-up that this parameter exists?

[1]: AMI ami-007855ac798b5175e: ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20230325

kupruser · October 23, 2023, 1:41pm

Sorry, that’s out of scope for us. One could easilly just check dvc doctor and see what link types are available.

Topic		Replies	Views
Add'ed files are duplicated in the cache, no links Questions	2	1002	August 26, 2019
Tracking data and code dependencies Questions	4	2143	May 18, 2018
DVC 0.9.7 release Announcements	0	950	May 17, 2018
Clone of repo with symlinked files creates copies not links Questions	0	454	August 25, 2021
Link types and run options Questions	4	1383	December 18, 2020

How to avoid data duplication between cache and workspace

Related topics