I am using DVC on macOS. According to the documentation, DVC won’t save the same file twice (one in the workspace and one in the cache).
In order to have the files present in both directories without duplication, DVC can automatically create **file links** to the cached data in the workspace. In fact, by default it will attempt to use reflinks* if supported by the file system.
However, for me, it keeps two copies of the same file, thereby increasing the disk space significantly. For example, if I have a file of around 5GB, the disk space for that project goes to around 10GB. Here is the output of the
dvc version command. According to the docs, the reflinks are supported by macOS and can be seen from dvc version output then why isn’t it working in my case? Any ideas on what I am missing here?
DVC version: 2.18.0 (brew)
Platform: Python 3.10.6 on macOS-12.5.1-x86_64-i386-64bit
azure (adlfs = 2022.7.0, knack = 0.9.0, azure-identity = 1.10.0),
gdrive (pydrive2 = 1.14.0),
gs (gcsfs = 2022.7.1),
webhdfs (fsspec = 2022.7.1),
http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
s3 (s3fs = 2022.7.1, boto3 = 1.21.21),
ssh (sshfs = 2022.6.0),
oss (ossfs = 2021.8.0),
webdav (webdav4 = 0.9.7),
webdavs (webdav4 = 0.9.7)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5s1
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc (subdir), git
I have the same issue using Ubuntu.
I am afraid that the problem is with the system rather than with DVC.
Let me explain that:
here is a
test_links.sh script that will help you understand:
rm -rf $wsp && mkdir $wsp && pushd $wsp
mkdir $rep && pushd $rep
dd if=/dev/urandom of=data bs=1M count=1
cp data data_copy
python -c "from dvc.fs import system;system.symlink('data', 'data_symlink')"
python -c "from dvc.fs import system;system.hardlink('data', 'data_hardlink')"
python -c "from dvc.fs import system;system.reflink('data', 'data_reflink')"
stat -f "inode: %i access: %a modification: %m changed: %c birth: %B %N" data data_copy data_symlink data_hardlink data_reflink
+ stat -f 'inode: %i access: %a modification: %m changed: %c birth: %B %N' data data_copy data_symlink data_hardlink data_reflink
inode: 139026482 access: 1663847851 modification: 1663847851 changed: 1663847855 birth: 1663847851 data
inode: 139026485 access: 1663847852 modification: 1663847852 changed: 1663847852 birth: 1663847852 data_copy
inode: 139026488 access: 1663847853 modification: 1663847853 changed: 1663847853 birth: 1663847853 data_symlink
inode: 139026482 access: 1663847851 modification: 1663847851 changed: 1663847855 birth: 1663847851 data_hardlink
inode: 139026495 access: 1663847851 modification: 1663847851 changed: 1663847856 birth: 1663847851 data_reflink
+ du -hd1
+ ls -alh
drwxr-xr-x 7 pawelredzynski staff 224B Sep 22 13:57 .
drwxr-xr-x 3 pawelredzynski staff 96B Sep 22 13:57 ..
-rw-r--r-- 2 pawelredzynski staff 1.0M Sep 22 13:57 data
-rw-r--r-- 1 pawelredzynski staff 1.0M Sep 22 13:57 data_copy
-rw-r--r-- 2 pawelredzynski staff 1.0M Sep 22 13:57 data_hardlink
-rw-r--r-- 1 pawelredzynski staff 1.0M Sep 22 13:57 data_reflink
lrwxr-xr-x 1 pawelredzynski staff 4B Sep 22 13:57 data_symlink -> data
We create 1MB file and try out different link types.
As we can see all of them has been created in different second (that’s why I put sleep there).
Lets focus on reflink - looking at stat, it yields that only inode and
changed time is different than
data - its
birthtime is from before we actually created the reflink - that’s because its a reflink. The problem that you are facing is that system doesn’t have a simple way of summarizing that - hence we get
du -hd1 - 1 MB for
data/data_copy/data_reflink (in ls -alh you can see that symlink takes only 4B). If you want to check that hardlink does not count, just comment out its creation -
du result will not change.
To summarize - if you want your system to pick up the fact that you are using links you would need to go for hardlinks or symlinks. To be sure that DVC will first try all links before going for copy you can run
dvc config cache.type "reflink,symlink,hardlink,copy".