Direct copy between Shared Cache and External Dependencies/Outputs

VladKol · September 12, 2020, 10:23pm

And to be very specific on the storage part, the scenario is running in Azure on AKS, Blob Storage and Azure File Shares. I absolutely need the cache because media files are stored in different storage tires (potentially even in the archive tier).
As usual, when we run training on large datasets on GPU, slow storage makes GPU nodes waste cycles waiting for dataset files to download. Instead, we run a pre-caching pipeline on a CPU node, copying the dataset files on a premium storage account (SSD) in Blob storage or Azure File Shares mounted via NFS or SMB.
Shared dvc cache is needed to avoid file duplication when training different models simultaneously (because fast storage is expensive).
Despite that caching being a relatively rare operation, no matter how fast the node is, on a large dataset, it will take a long time to pull files from the long-term storage to the node’s memory and then to the cache because of the node’s network throughput. Using az copy between 2 storage accounts directly makes the operations distributed across multiple azure storage nodes and therefore many times faster.

Topic		Replies	Views
Can and how DVC work with shared network storage? Questions	3	1580	January 1, 2020
Total data usage and use of external shared cache Questions	1	380	May 26, 2023
Shared development details Questions	2	842	December 9, 2020
DVC - can’t I track directly an S3 remote data? Questions	1	1270	July 12, 2019
Setup DVC to work with shared data on NAS server Questions	10	15117	June 12, 2019

Direct copy between Shared Cache and External Dependencies/Outputs

Related topics