Shared cache with huge repo

wilderone · July 10, 2025, 2:08pm

Hi all,

We’re currently evaluating DVC for a rather complex use case, and would appreciate any insights or recommendations.

Our context:

We’re working with a large repository (~2 TB) that includes a huge number of files — over 1000 — ranging in size from 1MB to 100GB, organized like:

db/datasets/ds1
db/mixtures/mx1
...

The repo is used by multiple developers (mostly on Windows) and also by automated processes (mostly Linux) — scheduled tasks, CI pipelines, etc.

Up until now, we’ve been using SVN, which handled this somehow (not great, but it worked). We’re now migrating to DVC for better versioning, history, and overall developer experience.

Our DVC setup:

Data storage: S3-compatible (MinIO)
DVC remote: s3://datasets/dvc
Cache: shared cache hosted on one Linux server (XFS + noatime)
Access methods we’ve tried: NFS, SMB, and rclone-mounted S3

Cache config:

[core]
    analytics = false
    remote = s3
[cache]
    dir = Y:/
    shared = group
    type = symlink
[remote "s3"]
    url = s3://datasets/dvc
    endpointurl = https://s3.somedomain.com

The problems we’re facing:

Cache performance is very poor — especially over NFS. Interaction with DVC (e.g. checkout, pull) is painfully slow.
Windows developers are hitting errors like:
```
File "C:\Python313\Lib\site-packages\dvc_data\hashfile\db\local.py", line 117, in protect
  os.chmod(path, self.CACHE_MODE)
OSError: [WinError 6] The handle is invalid: 'Y:/files\\md5\\ec\\352218c5b3676a3cd594034188759f'
```
My (possibly naive) intuition is that DVC is trying to “protect” the cache file with Windows-specific file permissions, which fails on a Linux-hosted SMB/NFS share. Not sure if that’s the real cause.
I’m aware of unprotect option, but it doesn’t seems like good idea if we want to keep data consistent.

What we’re trying to achieve:

Efficient shared cache across platforms
Reliable versioning and history tracking
Good user experience for developers (especially on Windows)
Minimal duplication of cache (disk space is a concern)

Any advice or best practices would be greatly appreciated — particularly:

Is it realistic to share DVC cache across platforms via SMB/NFS?
Any recommendations for Windows users to avoid the WinError 6 issue?
Overall, i have a feeling we are doing something completely wrong. How actual workflow for distributed teams looks like in real life?

Technical details

dvc doctor output from windows client

DVC version: 3.60.0 (choco)
---------------------------
Platform: Python 3.13.5 on Windows-2022Server-10.0.20348-SP0
Subprojects:
        dvc_data = 3.16.10
        dvc_objects = 5.1.1
        dvc_render = 1.0.2
        dvc_task = 0.40.2
        scmrepo = 3.3.11
Supports:
        azure (adlfs = 2024.12.0, knack = 0.12.0, azure-identity = 1.23.0),
        gdrive (pydrive2 = 1.21.3),
        gs (gcsfs = 2025.5.1),
        http (aiohttp = 3.12.13, aiohttp-retry = 2.9.1),
        https (aiohttp = 3.12.13, aiohttp-retry = 2.9.1),
        oss (ossfs = 2025.5.0),
        s3 (s3fs = 2025.5.1, boto3 = 1.38.27),
        ssh (sshfs = 2025.2.0)
Config:
        Global: C:\Users\gitlab-runner\AppData\Local\iterative\dvc
        System: C:\ProgramData\iterative\dvc
Cache types: symlink
Cache directory: NTFS on Y:\
Caches: local
Remotes: s3
Workspace directory: NTFS on E:\
Repo: dvc, git

dvc doctor output from linux machine, where we testing solution

DVC version: 3.60.1 (deb)
-------------------------
Platform: Python 3.12.11 on Linux-6.1.0-37-amd64-x86_64-with-glibc2.36
Subprojects:

Supports:
	azure (adlfs = 2024.12.0, knack = 0.12.0, azure-identity = 1.23.0),
	gdrive (pydrive2 = 1.21.3),
	gs (gcsfs = 2025.5.1),
	hdfs (fsspec = 2025.5.1, pyarrow = 20.0.0),
	http (aiohttp = 3.12.12, aiohttp-retry = 2.9.1),
	https (aiohttp = 3.12.12, aiohttp-retry = 2.9.1),
	oss (ossfs = 2025.5.0),
	s3 (s3fs = 2025.5.1, boto3 = 1.37.3),
	ssh (sshfs = 2025.2.0),
	webdav (webdav4 = 0.10.0),
	webdavs (webdav4 = 0.10.0),
	webhdfs (fsspec = 2025.5.1)
Config:
	Global: /root/.config/dvc
	System: /etc/xdg/dvc

Thanks in advance!

skshetry · July 15, 2025, 7:44am

Hi @wilderone.

checkout involves copying files from cache to the workspace. pull is fetch + cache, so it first downloads files to the cache and then checks them out to the workspace from the cache.

As you have cache over NFS, these operations are going to be slower. Also, DVC is not very optimized for NFS, as we do a lot of stat calls for the objects (aka files) in the cache which is also going to be painfully slow over NFS.

If you can provide profiling data, I can take a look and see if we can do something.

You can add --cprofile-dump <filename> to any dvc command to generate profiling data.

Eg:

dvc push --cprofile-dump push.prof

wilderone:

Windows developers are hitting errors like:
File "C:\Python313\Lib\site-packages\dvc_data\hashfile\db\local.py", line 117, in protect
  os.chmod(path, self.CACHE_MODE)
OSError: [WinError 6] The handle is invalid: 'Y:/files\\md5\\ec\\352218c5b3676a3cd594034188759f'
My (possibly naive) intuition is that DVC is trying to “protect” the cache file with Windows-specific file permissions, which fails on a Linux-hosted SMB/NFS share. Not sure if that’s the real cause.
I’m aware of unprotect option, but it doesn’t seems like good idea if we want to keep data consistent.

That’s not an error. It’s from a debug message that only shows up when you run dvc on --verbose/-v mode. You can ignore that.

Yes, many of our users use shared cache. For text files on Windows, ensure they’re saved with newline line endings. This won’t corrupt the cache, but saving from Windows can still modify the file.

On DVC, you can work independently in your Git branches with a shared cache/remote (between different platforms and different projects). The state of your git branch is what determines the version of your tracked files. On merges, there may be conflicts that need to be handled.

That said, I’ll suggest avoiding shared cache if possible and use local cache on each dev machine. And push/pull from/to remote directly.

With DVC, you can pull partial data and work with it. You can also add or update datasets without having to download everything.

See Modifying Large Datasets.

If you have more questions, please don’t hesitate to ask.

One more thing:
We also have an enterprise tool for managing and versioning datasets at scale, that might suit your workflows. If you are interested, please send us an email at support@dvc.org.

Topic		Replies	Views
Shared cache directory Questions	14	3154	July 5, 2018
Shared development details Questions	2	842	December 9, 2020
Total data usage and use of external shared cache Questions	1	386	May 26, 2023
Direct copy between Shared Cache and External Dependencies/Outputs Questions	10	1845	June 3, 2021
Can and how DVC work with shared network storage? Questions	3	1587	January 1, 2020