Hi everyone,
I’m in a situation that I believe many people must have come across, but unfortunately I see no threads online.
I’m training neural networks on a remote Google Compute Engine VM. My data is stored in a GCS Bucket, and I’ve versioned the dataset using DVC and a dedicated git repo.
I want to access my data from my VM, but the quantity is huge; several TB. Therefore, I can’t use a standard “dvc pull” and have all my data copied locally. I need to stream it from the bucket.
I tried 2 options, that both fail:
-
I mounted the bucket on my vm using GCSFuse, but data is “unreadable” as everything stays hashed. And I need to know what data I’m processing (especially since I don’t want to process different versions of the same data)
-
I used DVC’s python API with my git repo (
dvc.api.DVCFileSystem(repo, revision)).Although this works, it takes ages as soon as I ask for 10s of GB, so I don’t even want to try with the full dataset.
Right now, the only option I see is to keep track of the “latest” hash values by recomputing them at every dvc push and storing them in a file + using GCSFuse
Has anyone ever been in a similar situation ? Am I missing something in DVC’s Python API ? I hope I was clear enough.
Thanks in advance !