Direct copy between Shared Cache and External Dependencies/Outputs

I’m trying to get my head around external dependencies, outputs and cache. It’s not quite clear if I can achieve what I need with DVC, even though it seems to be very close. Here is what I’m trying to get.

I need to implement Shared Development Server scenario while optimizing download/upload operations between the cache and external datasets. Here is why and how:

  1. I have a fast cloud storage mounted to a Kubernetes cluster via NFS or SMB, and I want to use it as Shared Local Cache. Think about mounted AWS S3 bucket or Azure File Share.
  2. I have a large dataset that is stored in the same or another bucket/account, or potentially elsewhere, but always with an ability to copy files to/from my cache storage many times faster than if I was copying them via the machine where the DVC command runs.
  3. When I execute dvc run, the command can initiate direct copy between the cache storage bucket/share and the dataset bucket/account.
    Basically, what I want is the commands from these examples to run in the context of the shared cache folder, then dvc to create a hard/sym/ref-link to file in the shared cache, so the actual dataset bytes never make it to cluster until the pipeline code actually reads files. In the example with AWS S3, it would be similar to running
aws s3 cp s3://my_dataset_bucket/data.txt s3://my_cachhe_bucket/cache_path/data.txt

Let’s assume the command can actually recognize that the download destination is in the mounted cloud storage, not a usual directory. Same for the opposite direction.
The way it looks to me now, this command

aws s3 cp s3://mybucket/data.txt data.txt

downloads data.txt somewhere locally (where?), then moves that to the cache, and make a link in the workspace.
Thank you!

Hey @VladKol sorry for the delay.

I’m not sure this is what you need. You seem to want to share a DVC cache among people working from different machines, not by logging into a single shared server. Please confirm (so I know I’m on the right track here).

The part about configuring an (external) shared cache is appropriate though:

OK, so far this is straightforward to setup:

$ dvc cache dir /mnt/nfs-drive
$ dvc config cache.shared group

It’s not clear to me why you want to move this dataset in the first place, especially if it’s not in the same bucket already (that would cause data duplication). Have you explored the possibility of adding the data externally (as an external output)? See https://dvc.org/doc/user-guide/managing-external-data. Notice that no link or copy is created in the workspace for external outputs though (thus “external”).

If what you need is to process the dataset then you would need it locally indeed, so a download is necessary anyway, right? Or possibly it can be processed without ever downloading the full file/dir with code that integrates with the could storage e.g. boto3 for S3/Python and reads only the data that it needs.

Please clarify where you see that I haven’t understood you correctly to give you better ideas. Thanks!

1 Like

Just to add to this- you most likely want to enable symlinks in the shared cache scenario (dvc cache.dir, etc)

2 Likes

@jorgeorpinel @shcheklein you are right, it’s not quite a Shared Development Server. I would like to have a file share that is used as dvc cache for multiple “users” - to avoid downloading duplicates of large files. So that would be external cache then, and files in the local repo would be symlinks to the cache. DVC “doesn’t know” that the cache file share is actually a mounted S3 bucket. But my training code does.
Second part is a bit more complicated.
I have a huge dataset: hundreds of thousands of images and video files, some of them are multiple gigabytes.
First thing they create is a Data Registry repo where media files are added individually. I don’t need to track changes in the files themselves - they rarely change. I need to track what files have been added or removed from the repo. Think about this repo as a catalog of all files they can possibly label.
What is better to use for that? dvc add --external or dvc import-url?
Next, other people start labeling these files - they should be able to clone the repo, pull individual files (not the whole multi-terabyte dataset) and create .csv-files with labeling data, in the same repo, stored and tracked pretty much by Git.
Next step is creating individual dataset repos. For example, I want to create an object detection model for apples and bananas. I create an empty repo, then use dvc import to add pictures ands videos of apples and bananas from the Data Registry, based on their labeling. These files may have other objects labeled as well - that’s fine. I also randomly add pictures without apples or bananas - you know, for making the dataset balanced. Ideally, I would like to do that without downloading the files (–no-exec ?) because this stage is entirely based on labels.
Next phase is about experiments. I create branches of the dataset repo, add additional labeling, remove some media files, import some more, etc.
Every time I train a model, I do it on a Kubernetes cluster in the cloud, and it uses an S3 bucket mounted as a shared dvc cache. At this point, I clone a dataset repo (branch/commit) entirely, and I need all media files to be available. At the same time, the same cluster may be used for training a different model with a different dataset, and some of the media files may be the same. I’d like to avoid re-downloading those, that’s why the shared cache is important.
As we remember, all our media files are in different S3 buckets, so when I pull the repo, I’d rather copy media files directly from the storage bucket to the cache bucket, without downloading to the cluster’s nodes memory.
Right now I see the last step being impossible with dvc. I’m thinking about skipping the dvc cache entirely, just git-cloning the repo, enumerating and analyzing .dvc files myself programmatically, copying media files to a shared cache that I maintain myself.
I wonder if there is a better way. If not, do you think it’s a valid scenario for dvc in the future?
Thank you!

2 Likes

And to be very specific on the storage part, the scenario is running in Azure on AKS, Blob Storage and Azure File Shares. I absolutely need the cache because media files are stored in different storage tires (potentially even in the archive tier).
As usual, when we run training on large datasets on GPU, slow storage makes GPU nodes waste cycles waiting for dataset files to download. Instead, we run a pre-caching pipeline on a CPU node, copying the dataset files on a premium storage account (SSD) in Blob storage or Azure File Shares mounted via NFS or SMB.
Shared dvc cache is needed to avoid file duplication when training different models simultaneously (because fast storage is expensive).
Despite that caching being a relatively rare operation, no matter how fast the node is, on a large dataset, it will take a long time to pull files from the long-term storage to the node’s memory and then to the cache because of the node’s network throughput. Using az copy between 2 storage accounts directly makes the operations distributed across multiple azure storage nodes and therefore many times faster.

2 Likes

So the only way to put data in the cache is to have it locally in the workspace first, and then track it with DVC (dvc add/commit/run/repro). This moves it to the cache and links it back to the workspace.

Again, the exception is when tracking existing data in a supported remote location a.k.a. “external outputs” (the external cache should be configured in the same remote location first), in which case the data is never downloaded (nor linked) to the local machine.

Also, the DVC cache has a special structure. All this means unfortunately there’s no good way to put the files in the cache first and then link them to your local workspace. I mean, it’s doable, but it would be a manual hack.

Also I’m not sure you can link files in a virtual network mount of an Azure bucket into physical drives, or if that prevents downloads when the data needs to be actually accessed. I’d rely on you for confirmation on that. That would be an interesting proposition to explore for sure though, although it still seems overcomplicated…

@VladKol is adding the data externally not enough? You code can still access it in its mounted location directly. And DVC can still detect when it changes if used as external dependencies, thus reproducing the needed stages downstream.

@VladKol what you describe is a pretty common workflow for managing CV datasets. There is no special support in DVC for that yet (though we have been extensively discussing how to make that happen). I see why you can be interested in using DVC though- caching (and especially cache + symlinks) is a very powerful mechanism! What DVC lacks is the way to “slice” these datasets granularly based on their labels.

Before we jump into some possible workarounds or ideas besides DVC, let me ask you a few questions:

they should be able to clone the repo, pull individual files

How do you see they determine which file to pull?

and create .csv-files with labeling data

Do you have an example of a possible structure for this file? Is it something like file-name, class? Do you have multiple labels per image?

then use dvc import to add pictures ands videos of apples and bananas from the Data Registry, based on their labeling.

that’s the biggest problem to my mind. Unless you decide to store all images that belong to a certain class within a separate directory it’s would be hard to achieve this with the current DVC.

As we remember, all our media files are in different S3 buckets, so when I pull the repo, I’d rather copy media files directly from the storage bucket to the cache bucket, without downloading to the cluster’s nodes memory.

are you okay with duplicating data on S3? Also, I’m lost here with the Data Registry repo where media files are added individually concept. Is data registry based on S3 cache, or on all those buckets with different lifecycle policies.

2 Likes

I’ll excuse myself from the more elaborate part involving your full workflow but here are a few last specific answers from me in case they help:

This can be done by adding the entire directories in question e.g. dvc add media/images. See Data Registries. I don’t think this will work with --external outputs though.

Note that import-url is not useful for registries, as chained imports are not supported.

All DVC commands that accept targets support granularity e.g. dvc pull media/images/apples/12345.jpg.

dvc import always downloads the data and has no --no-exec option at the moment (see https://github.com/iterative/dvc/issues/4567 though). If all you need are the labels, you can import those CSV files directly (DVC supports importing files tracked by Git).

Hacky: You could just copy the corresponding .dvc files from the registry to your project, and setup the same DVC remote the registry uses for this project, so the files can be dvc pulled later needed.

Thanks

1 Like