Running out of storage when streaming with DVCFileSystem

I have the following setup:

  • A git/dvc repo on gitlab, with a Google cloud storage bucket as remote, used as a data registry

  • Another git repo with code for training a machine learning model

I am able to load files from my training data using DVCFileSystem.open(). However, the dataset is >1TB, so after the training has gone on for a while my docker container runs out of storage space (100GB).

I have identified a file: /var/tmp/dvc/repo/a21a7b1a105c2c3634b082bc0dc5d0b9/index/data/db.db that seems to keep growing. Is there a way to limit the size of this file?

I’ve found a workaround although not a great one. If I set the environment variable DVC_SITE_CACHE_DIR, it will put the db-file there. So what I did was to point this to a directory in cloud storage which is effectively infinite. But now I have to go there to clean it up after each training job unless I want to pay for storing a bunch of old caches.