Direct copy between Shared Cache and External Dependencies/Outputs

I’m trying to get my head around external dependencies, outputs and cache. It’s not quite clear if I can achieve what I need with DVC, even though it seems to be very close. Here is what I’m trying to get.

I need to implement Shared Development Server scenario while optimizing download/upload operations between the cache and external datasets. Here is why and how:

  1. I have a fast cloud storage mounted to a Kubernetes cluster via NFS or SMB, and I want to use it as Shared Local Cache. Think about mounted AWS S3 bucket or Azure File Share.
  2. I have a large dataset that is stored in the same or another bucket/account, or potentially elsewhere, but always with an ability to copy files to/from my cache storage many times faster than if I was copying them via the machine where the DVC command runs.
  3. When I execute dvc run, the command can initiate direct copy between the cache storage bucket/share and the dataset bucket/account.
    Basically, what I want is the commands from these examples to run in the context of the shared cache folder, then dvc to create a hard/sym/ref-link to file in the shared cache, so the actual dataset bytes never make it to cluster until the pipeline code actually reads files. In the example with AWS S3, it would be similar to running
aws s3 cp s3://my_dataset_bucket/data.txt s3://my_cachhe_bucket/cache_path/data.txt

Let’s assume the command can actually recognize that the download destination is in the mounted cloud storage, not a usual directory. Same for the opposite direction.
The way it looks to me now, this command

aws s3 cp s3://mybucket/data.txt data.txt

downloads data.txt somewhere locally (where?), then moves that to the cache, and make a link in the workspace.
Thank you!

Hey @VladKol sorry for the delay.

I’m not sure this is what you need. You seem to want to share a DVC cache among people working from different machines, not by logging into a single shared server. Please confirm (so I know I’m on the right track here).

The part about configuring an (external) shared cache is appropriate though:

OK, so far this is straightforward to setup:

$ dvc cache dir /mnt/nfs-drive
$ dvc config cache.shared group

It’s not clear to me why you want to move this dataset in the first place, especially if it’s not in the same bucket already (that would cause data duplication). Have you explored the possibility of adding the data externally (as an external output)? See External Data. Notice that no link or copy is created in the workspace for external outputs though (thus “external”).

If what you need is to process the dataset then you would need it locally indeed, so a download is necessary anyway, right? Or possibly it can be processed without ever downloading the full file/dir with code that integrates with the could storage e.g. boto3 for S3/Python and reads only the data that it needs.

Please clarify where you see that I haven’t understood you correctly to give you better ideas. Thanks!

1 Like

Just to add to this- you most likely want to enable symlinks in the shared cache scenario (dvc cache.dir, etc)

2 Likes

@jorgeorpinel @shcheklein you are right, it’s not quite a Shared Development Server. I would like to have a file share that is used as dvc cache for multiple “users” - to avoid downloading duplicates of large files. So that would be external cache then, and files in the local repo would be symlinks to the cache. DVC “doesn’t know” that the cache file share is actually a mounted S3 bucket. But my training code does.
Second part is a bit more complicated.
I have a huge dataset: hundreds of thousands of images and video files, some of them are multiple gigabytes.
First thing they create is a Data Registry repo where media files are added individually. I don’t need to track changes in the files themselves - they rarely change. I need to track what files have been added or removed from the repo. Think about this repo as a catalog of all files they can possibly label.
What is better to use for that? dvc add --external or dvc import-url?
Next, other people start labeling these files - they should be able to clone the repo, pull individual files (not the whole multi-terabyte dataset) and create .csv-files with labeling data, in the same repo, stored and tracked pretty much by Git.
Next step is creating individual dataset repos. For example, I want to create an object detection model for apples and bananas. I create an empty repo, then use dvc import to add pictures ands videos of apples and bananas from the Data Registry, based on their labeling. These files may have other objects labeled as well - that’s fine. I also randomly add pictures without apples or bananas - you know, for making the dataset balanced. Ideally, I would like to do that without downloading the files (–no-exec ?) because this stage is entirely based on labels.
Next phase is about experiments. I create branches of the dataset repo, add additional labeling, remove some media files, import some more, etc.
Every time I train a model, I do it on a Kubernetes cluster in the cloud, and it uses an S3 bucket mounted as a shared dvc cache. At this point, I clone a dataset repo (branch/commit) entirely, and I need all media files to be available. At the same time, the same cluster may be used for training a different model with a different dataset, and some of the media files may be the same. I’d like to avoid re-downloading those, that’s why the shared cache is important.
As we remember, all our media files are in different S3 buckets, so when I pull the repo, I’d rather copy media files directly from the storage bucket to the cache bucket, without downloading to the cluster’s nodes memory.
Right now I see the last step being impossible with dvc. I’m thinking about skipping the dvc cache entirely, just git-cloning the repo, enumerating and analyzing .dvc files myself programmatically, copying media files to a shared cache that I maintain myself.
I wonder if there is a better way. If not, do you think it’s a valid scenario for dvc in the future?
Thank you!

2 Likes

And to be very specific on the storage part, the scenario is running in Azure on AKS, Blob Storage and Azure File Shares. I absolutely need the cache because media files are stored in different storage tires (potentially even in the archive tier).
As usual, when we run training on large datasets on GPU, slow storage makes GPU nodes waste cycles waiting for dataset files to download. Instead, we run a pre-caching pipeline on a CPU node, copying the dataset files on a premium storage account (SSD) in Blob storage or Azure File Shares mounted via NFS or SMB.
Shared dvc cache is needed to avoid file duplication when training different models simultaneously (because fast storage is expensive).
Despite that caching being a relatively rare operation, no matter how fast the node is, on a large dataset, it will take a long time to pull files from the long-term storage to the node’s memory and then to the cache because of the node’s network throughput. Using az copy between 2 storage accounts directly makes the operations distributed across multiple azure storage nodes and therefore many times faster.

2 Likes

So the only way to put data in the cache is to have it locally in the workspace first, and then track it with DVC (dvc add/commit/run/repro). This moves it to the cache and links it back to the workspace.

Again, the exception is when tracking existing data in a supported remote location a.k.a. “external outputs” (the external cache should be configured in the same remote location first), in which case the data is never downloaded (nor linked) to the local machine.

Also, the DVC cache has a special structure. All this means unfortunately there’s no good way to put the files in the cache first and then link them to your local workspace. I mean, it’s doable, but it would be a manual hack.

Also I’m not sure you can link files in a virtual network mount of an Azure bucket into physical drives, or if that prevents downloads when the data needs to be actually accessed. I’d rely on you for confirmation on that. That would be an interesting proposition to explore for sure though, although it still seems overcomplicated…

@VladKol is adding the data externally not enough? You code can still access it in its mounted location directly. And DVC can still detect when it changes if used as external dependencies, thus reproducing the needed stages downstream.

@VladKol what you describe is a pretty common workflow for managing CV datasets. There is no special support in DVC for that yet (though we have been extensively discussing how to make that happen). I see why you can be interested in using DVC though- caching (and especially cache + symlinks) is a very powerful mechanism! What DVC lacks is the way to “slice” these datasets granularly based on their labels.

Before we jump into some possible workarounds or ideas besides DVC, let me ask you a few questions:

they should be able to clone the repo, pull individual files

How do you see they determine which file to pull?

and create .csv-files with labeling data

Do you have an example of a possible structure for this file? Is it something like file-name, class? Do you have multiple labels per image?

then use dvc import to add pictures ands videos of apples and bananas from the Data Registry, based on their labeling.

that’s the biggest problem to my mind. Unless you decide to store all images that belong to a certain class within a separate directory it’s would be hard to achieve this with the current DVC.

As we remember, all our media files are in different S3 buckets, so when I pull the repo, I’d rather copy media files directly from the storage bucket to the cache bucket, without downloading to the cluster’s nodes memory.

are you okay with duplicating data on S3? Also, I’m lost here with the Data Registry repo where media files are added individually concept. Is data registry based on S3 cache, or on all those buckets with different lifecycle policies.

2 Likes

I’ll excuse myself from the more elaborate part involving your full workflow but here are a few last specific answers from me in case they help:

This can be done by adding the entire directories in question e.g. dvc add media/images. See Data Registries. I don’t think this will work with --external outputs though.

Note that import-url is not useful for registries, as chained imports are not supported.

All DVC commands that accept targets support granularity e.g. dvc pull media/images/apples/12345.jpg.

dvc import always downloads the data and has no --no-exec option at the moment (see import: support --no-exec · Issue #4567 · iterative/dvc · GitHub though). If all you need are the labels, you can import those CSV files directly (DVC supports importing files tracked by Git).

Hacky: You could just copy the corresponding .dvc files from the registry to your project, and setup the same DVC remote the registry uses for this project, so the files can be dvc pulled later needed.

Thanks

1 Like

Hello everyone,

I am sorry to bump this topic, but I think that it greatly fits the use case that we have been trying to implement for days with my coworker, therefore I think that it is better to have a unified topic on the subject. More specifically, I believe that DVC could benefit from a system dedicated to experiments on large unstructured datasets , which are defined by the fact that the training and test datasets are only modified by the addition and removal of files , rather than the modification of a limited number of single text files (a csv for instance).

The use case is as follows: we have a S3 remote storage, and we want to be able to track a dataset composed of roughly 20 000 images (in other situations, the number of images woud be far greater). The dataset is currently stored on a S3 storage.

We would like to be able to follow this protocol :

  • create a tag or a branch once the inital data importation is finished;
  • try a first experiment;
  • remove or add images before creating a new tag or branch;
  • trying a new experiment with the new set of images

The data would be in a data registry while the experiments are conducted from the workspace of a project repo. Ideally, DVC would quickly restore the images or add the new ones between experiments.

While reading this topic, something clicked about our use case, the Catalog concept . We tried to do the exact same thing as @VladKol wanted to implement, i.e adding or removing symbolic links rather than the images themselves, and I understand that it is impossible due to the nature of DVC files.

However, what would be perfect for this case would be a dedicated object (a Catalog object maybe ?) tracked by dvc : the idea would be that this folder would be populated by rather immovable files (deletions would be possible but not frequent and would mainly be garbage collection), and there would be commands to create dvc files which link to specific images in the catalog (like a Catalog Subset ) without any data duplication. Data scientists could use a command to modify the Catalog Subset object by removing or adding images contained in the Catalog: each version of the Subset could be saved as a DVC file, which would enable quick changes to the train and test datasets between experiments by for instance reverting to an older Subset containing fewer or more images. This would reconstruct within the workspace the direct links to the selected images in the subset . The perks would be an absence of data duplication by directly streaming files from the S3 while mainting versioning capabilities.

An idea of syntax could be :

dvc create catalog --name catalog_name --path /path/to/catalog/folder

# Different options could be possible for the subset :

dvc catalog subset subset_name --from-catalog catalog_name --re <regex command to filter images>
dvc catalog subset subset_name --from-catalog catalog_name --rand 0.3 # a number between 0 and 1 to select a sample of images from the catalog
dvc catalog subset subset_name --from-catalog catalog_name --list <a list of images from the catalog>
dvc catalog subset subset_name --from-catalog catalog_name --ui-select # launches a hypothetical UI to select images in the catalog

Now what would be even better to complement this use case (but I am aware that it is way more complicated) would be a command such as “dvc catalog view” which would open an image viewer to take a look at the images in the catalog and a command “dvc catalog-subset view” which would do the same for the subset.

What are your opinions on this use case ? Would it be compatible with dvc’s concepts ? Because I believe that there are a lot of data scientists working on image datasets who would be very interested in such a feature as they lack useful tools for this situation.

3 Likes

Hi Louis,

We at DVC have been working on a solution to implement this functionality for some time now.

Would you be able to participate in our user research on this? It would entail giving you mockup system commands and see if they map well onto your use case.

Hi Daniel,

I would be glad to participate in your user research. Feel free to contact me on this forum or on my email adress if you have access to it.