And to be very specific on the storage part, the scenario is running in Azure on AKS, Blob Storage and Azure File Shares. I absolutely need the cache because media files are stored in different storage tires (potentially even in the archive tier).
As usual, when we run training on large datasets on GPU, slow storage makes GPU nodes waste cycles waiting for dataset files to download. Instead, we run a pre-caching pipeline on a CPU node, copying the dataset files on a premium storage account (SSD) in Blob storage or Azure File Shares mounted via NFS or SMB.
Shared dvc cache is needed to avoid file duplication when training different models simultaneously (because fast storage is expensive).
Despite that caching being a relatively rare operation, no matter how fast the node is, on a large dataset, it will take a long time to pull files from the long-term storage to the node’s memory and then to the cache because of the node’s network throughput. Using az copy between 2 storage accounts directly makes the operations distributed across multiple azure storage nodes and therefore many times faster.
2 Likes