Mounting GCS Bucket versioned with DVC on a VM

odex · October 24, 2025, 4:43pm

Hi everyone,

I’m in a situation that I believe many people must have come across, but unfortunately I see no threads online.

I’m training neural networks on a remote Google Compute Engine VM. My data is stored in a GCS Bucket, and I’ve versioned the dataset using DVC and a dedicated git repo.

I want to access my data from my VM, but the quantity is huge; several TB. Therefore, I can’t use a standard “dvc pull” and have all my data copied locally. I need to stream it from the bucket.

I tried 2 options, that both fail:

I mounted the bucket on my vm using GCSFuse, but data is “unreadable” as everything stays hashed. And I need to know what data I’m processing (especially since I don’t want to process different versions of the same data)
I used DVC’s python API with my git repo (dvc.api.DVCFileSystem(repo, revision)). Although this works, it takes ages as soon as I ask for 10s of GB, so I don’t even want to try with the full dataset.

Right now, the only option I see is to keep track of the “latest” hash values by recomputing them at every dvc push and storing them in a file + using GCSFuse

Has anyone ever been in a similar situation ? Am I missing something in DVC’s Python API ? I hope I was clear enough.

Thanks in advance !

Topic		Replies	Views
Dvc and gcp/colab tricks Questions	3	2028	December 15, 2020
Retrieve data after using dvc import, but deleting original git repo Questions	2	1573	November 1, 2019
DVC compared with GitLFS for storage and versioning only Questions	12	7005	October 13, 2020
DVC Heartbeat - Discord gems Announcements	3	4171	June 27, 2019
How to pull data from GCS without pipelines Questions	4	713	March 9, 2021

Mounting GCS Bucket versioned with DVC on a VM

Related topics