Mounting GCS Bucket versioned with DVC on a VM

Hi everyone,

I’m in a situation that I believe many people must have come across, but unfortunately I see no threads online.

I’m training neural networks on a remote Google Compute Engine VM. My data is stored in a GCS Bucket, and I’ve versioned the dataset using DVC and a dedicated git repo.

I want to access my data from my VM, but the quantity is huge; several TB. Therefore, I can’t use a standard “dvc pull” and have all my data copied locally. I need to stream it from the bucket.

I tried 2 options, that both fail:

  1. I mounted the bucket on my vm using GCSFuse, but data is “unreadable” as everything stays hashed. And I need to know what data I’m processing (especially since I don’t want to process different versions of the same data)

  2. I used DVC’s python API with my git repo (dvc.api.DVCFileSystem(repo, revision)). Although this works, it takes ages as soon as I ask for 10s of GB, so I don’t even want to try with the full dataset.

Right now, the only option I see is to keep track of the “latest” hash values by recomputing them at every dvc push and storing them in a file + using GCSFuse

Has anyone ever been in a similar situation ? Am I missing something in DVC’s Python API ? I hope I was clear enough.

Thanks in advance !