I’m trying to get my head around external dependencies, outputs and cache. It’s not quite clear if I can achieve what I need with DVC, even though it seems to be very close. Here is what I’m trying to get.
I need to implement Shared Development Server scenario while optimizing download/upload operations between the cache and external datasets. Here is why and how:
- I have a fast cloud storage mounted to a Kubernetes cluster via NFS or SMB, and I want to use it as Shared Local Cache. Think about mounted AWS S3 bucket or Azure File Share.
- I have a large dataset that is stored in the same or another bucket/account, or potentially elsewhere, but always with an ability to copy files to/from my cache storage many times faster than if I was copying them via the machine where the DVC command runs.
- When I execute dvc run, the command can initiate direct copy between the cache storage bucket/share and the dataset bucket/account.
Basically, what I want is the commands from these examples to run in the context of the shared cache folder, then dvc to create a hard/sym/ref-link to file in the shared cache, so the actual dataset bytes never make it to cluster until the pipeline code actually reads files. In the example with AWS S3, it would be similar to running
aws s3 cp s3://my_dataset_bucket/data.txt s3://my_cachhe_bucket/cache_path/data.txt
Let’s assume the command can actually recognize that the download destination is in the mounted cloud storage, not a usual directory. Same for the opposite direction.
The way it looks to me now, this command
aws s3 cp s3://mybucket/data.txt data.txt
downloads data.txt somewhere locally (where?), then moves that to the cache, and make a link in the workspace.