I have the following setup:
-
A git/dvc repo on gitlab, with a Google cloud storage bucket as remote, used as a data registry
-
Another git repo with code for training a machine learning model
I am able to load files from my training data using DVCFileSystem.open(). However, the dataset is >1TB, so after the training has gone on for a while my docker container runs out of storage space (100GB).
I have identified a file: /var/tmp/dvc/repo/a21a7b1a105c2c3634b082bc0dc5d0b9/index/data/db.db that seems to keep growing. Is there a way to limit the size of this file?