Mounting GCS Bucket versioned with DVC on a VM

Hi everyone,

I’m in a situation that I believe many people must have come across, but unfortunately I see no threads online.

I’m training neural networks on a remote Google Compute Engine VM. My data is stored in a GCS Bucket, and I’ve versioned the dataset using DVC and a dedicated git repo.

I want to access my data from my VM, but the quantity is huge; several TB. Therefore, I can’t use a standard “dvc pull” and have all my data copied locally. I need to stream it from the bucket.

I tried 2 options, that both fail:

  1. I mounted the bucket on my vm using GCSFuse, but data is “unreadable” as everything stays hashed. And I need to know what data I’m processing (especially since I don’t want to process different versions of the same data)

  2. I used DVC’s python API with my git repo (dvc.api.DVCFileSystem(repo, revision)). Although this works, it takes ages as soon as I ask for 10s of GB, so I don’t even want to try with the full dataset.

Right now, the only option I see is to keep track of the “latest” hash values by recomputing them at every dvc push and storing them in a file + using GCSFuse

Has anyone ever been in a similar situation ? Am I missing something in DVC’s Python API ? I hope I was clear enough.

Thanks in advance !

2 Likes

Hi, @odex. Welcome to the forum.

Sorry for the late response. Could you please provide more details about your dataset (how many files, average size of the files and total size of dataset(s))? Also how are you running your training?

And what kind of machines are you running this training on?

Also, could you please share how you are using DVCFileSystem? Are you using it directly or via some libraries/wrappers?

If you need partial data, you can provide filter to the command, eg:

dvc push data/subset/ # only download data/subset files

Of course, if you need everything for training, this won’t work.

Could you share what APIs you used? Was it DVCFileSystem.open()? Did you also test downloading files using DVCFilesystem.get()?

DVCFilesystem is going to be slower as it has to always check for files in the cache. And, DVCFilesystem.open() APIs are going to be slower as they are single-threaded. DVCFileSystem.get()however is multithreaded but it’s still not fast as using asyncio.

You can also avoid filesystem layer and access remote location using dvc.api.get_url() for example and handle downloads yourself.


I’d be interested to see what is slowing down DVCFileSystem. If you can provide a cProfile output, I’d be happy to take a look. But TBs of data is definitely at the limits of DVC.