Shared cache with huge repo

Hi all,

We’re currently evaluating DVC for a rather complex use case, and would appreciate any insights or recommendations.

Our context:

We’re working with a large repository (~2 TB) that includes a huge number of files — over 1000 — ranging in size from 1MB to 100GB, organized like:

db/datasets/ds1
db/mixtures/mx1
...

The repo is used by multiple developers (mostly on Windows) and also by automated processes (mostly Linux) — scheduled tasks, CI pipelines, etc.

Up until now, we’ve been using SVN, which handled this somehow (not great, but it worked). We’re now migrating to DVC for better versioning, history, and overall developer experience.


Our DVC setup:

  • Data storage: S3-compatible (MinIO)

  • DVC remote: s3://datasets/dvc

  • Cache: shared cache hosted on one Linux server (XFS + noatime)

  • Access methods we’ve tried: NFS, SMB, and rclone-mounted S3

  • Cache config:

    [core]
        analytics = false
        remote = s3
    [cache]
        dir = Y:/
        shared = group
        type = symlink
    [remote "s3"]
        url = s3://datasets/dvc
        endpointurl = https://s3.somedomain.com
    

The problems we’re facing:

  1. Cache performance is very poor — especially over NFS. Interaction with DVC (e.g. checkout, pull) is painfully slow.

  2. Windows developers are hitting errors like:

    File "C:\Python313\Lib\site-packages\dvc_data\hashfile\db\local.py", line 117, in protect
      os.chmod(path, self.CACHE_MODE)
    OSError: [WinError 6] The handle is invalid: 'Y:/files\\md5\\ec\\352218c5b3676a3cd594034188759f'
    

    My (possibly naive) intuition is that DVC is trying to “protect” the cache file with Windows-specific file permissions, which fails on a Linux-hosted SMB/NFS share. Not sure if that’s the real cause.
    I’m aware of unprotect option, but it doesn’t seems like good idea if we want to keep data consistent.


What we’re trying to achieve:

  • Efficient shared cache across platforms
  • Reliable versioning and history tracking
  • Good user experience for developers (especially on Windows)
  • Minimal duplication of cache (disk space is a concern)

Any advice or best practices would be greatly appreciated — particularly:

  • Is it realistic to share DVC cache across platforms via SMB/NFS?
  • Any recommendations for Windows users to avoid the WinError 6 issue?
  • Overall, i have a feeling we are doing something completely wrong. How actual workflow for distributed teams looks like in real life?

Technical details

dvc doctor output from windows client

DVC version: 3.60.0 (choco)
---------------------------
Platform: Python 3.13.5 on Windows-2022Server-10.0.20348-SP0
Subprojects:
        dvc_data = 3.16.10
        dvc_objects = 5.1.1
        dvc_render = 1.0.2
        dvc_task = 0.40.2
        scmrepo = 3.3.11
Supports:
        azure (adlfs = 2024.12.0, knack = 0.12.0, azure-identity = 1.23.0),
        gdrive (pydrive2 = 1.21.3),
        gs (gcsfs = 2025.5.1),
        http (aiohttp = 3.12.13, aiohttp-retry = 2.9.1),
        https (aiohttp = 3.12.13, aiohttp-retry = 2.9.1),
        oss (ossfs = 2025.5.0),
        s3 (s3fs = 2025.5.1, boto3 = 1.38.27),
        ssh (sshfs = 2025.2.0)
Config:
        Global: C:\Users\gitlab-runner\AppData\Local\iterative\dvc
        System: C:\ProgramData\iterative\dvc
Cache types: symlink
Cache directory: NTFS on Y:\
Caches: local
Remotes: s3
Workspace directory: NTFS on E:\
Repo: dvc, git

dvc doctor output from linux machine, where we testing solution

DVC version: 3.60.1 (deb)
-------------------------
Platform: Python 3.12.11 on Linux-6.1.0-37-amd64-x86_64-with-glibc2.36
Subprojects:

Supports:
	azure (adlfs = 2024.12.0, knack = 0.12.0, azure-identity = 1.23.0),
	gdrive (pydrive2 = 1.21.3),
	gs (gcsfs = 2025.5.1),
	hdfs (fsspec = 2025.5.1, pyarrow = 20.0.0),
	http (aiohttp = 3.12.12, aiohttp-retry = 2.9.1),
	https (aiohttp = 3.12.12, aiohttp-retry = 2.9.1),
	oss (ossfs = 2025.5.0),
	s3 (s3fs = 2025.5.1, boto3 = 1.37.3),
	ssh (sshfs = 2025.2.0),
	webdav (webdav4 = 0.10.0),
	webdavs (webdav4 = 0.10.0),
	webhdfs (fsspec = 2025.5.1)
Config:
	Global: /root/.config/dvc
	System: /etc/xdg/dvc

Thanks in advance!

Hi @wilderone.

checkout involves copying files from cache to the workspace. pull is fetch + cache, so it first downloads files to the cache and then checks them out to the workspace from the cache.

As you have cache over NFS, these operations are going to be slower. Also, DVC is not very optimized for NFS, as we do a lot of stat calls for the objects (aka files) in the cache which is also going to be painfully slow over NFS.

If you can provide profiling data, I can take a look and see if we can do something.

You can add --cprofile-dump <filename> to any dvc command to generate profiling data.

Eg:

dvc push --cprofile-dump push.prof

That’s not an error. It’s from a debug message that only shows up when you run dvc on --verbose/-v mode. You can ignore that.

Yes, many of our users use shared cache. For text files on Windows, ensure they’re saved with newline line endings. This won’t corrupt the cache, but saving from Windows can still modify the file.

On DVC, you can work independently in your Git branches with a shared cache/remote (between different platforms and different projects). The state of your git branch is what determines the version of your tracked files. On merges, there may be conflicts that need to be handled.

That said, I’ll suggest avoiding shared cache if possible and use local cache on each dev machine. And push/pull from/to remote directly.

With DVC, you can pull partial data and work with it. You can also add or update datasets without having to download everything.

See Modifying Large Datasets.

If you have more questions, please don’t hesitate to ask. :slightly_smiling_face:

One more thing:
We also have an enterprise tool for managing and versioning datasets at scale, that might suit your workflows. If you are interested, please send us an email at support@dvc.org.