Access remote data instead of downloading it

Hello,

I would like to version a Dataset of 20 TB that continue to increase in a bucket S3.

Some colleagues and myself will run some experiments using this data on our PCs, there is a way to access the remote (using boto3 or other option) data instead of downloading it with dvc pull ?

Hi @ecram, I have some questions to help determine what would be best.

  1. How is the data generated? Do you generate it locally and dvc push to s3, or is it updated directly in s3?
  2. Since you have 20 TB, will you be running experiments on all of it or some subset? If you will use a subset, how do you select the subset?
  3. Are files ever modified or deleted, or are they only added?
  4. Do you have versioning enabled on your s3 bucket?

Hi dberenbaum,

Please find my answers in bold.

  1. How is the data generated? Do you generate it locally and dvc push to s3, or is it updated directly in s3? We collect it and then upload it to our S3 bucket
  2. Since you have 20 TB, will you be running experiments on all of it or some subset? If you will use a subset, how do you select the subset? Most of the time we run experiment using the full 20 TB of data
  3. Are files ever modified or deleted, or are they only added? We mainly add new files, but we have a few rare cases that require that we update our files (for example correct the annotated value)
  4. Do you have versioning enabled on your s3 bucket? No, we only have the raw data.

You mention that the annotations might change. What is the structure of your data? Do you have separate annotations and raw data (like images or text)? What format are the annotations and how do they refer to the raw data? I’m wondering whether it is worthwhile to version all 20 TB or only the annotations, especially if the raw data is never modified or deleted.

Also, are you able to enable versioning on your s3 bucket? Take a look at Using versioning in S3 buckets - Amazon Simple Storage Service if you are unfamiliar. That would mean s3 would automatically version all the data for you already.

1 Like

That’s a good idea, thank you for your support.

Currently I am trying to add only the annotations (small files that sum up ~ 4 GB) but dvc add takes more than 1 hour to complete. Is this normal ?

It’s not unusual that DVC is slow with many small files (we are working on improving it), but an hour for 4 GB sounds longer than expected. Can you follow up with this info:

  • How many small files do you have?
  • How are you adding them and where are they stored?
  • Can you share the output of dvc doctor?

Sure:

  • I have 1689965 files
  • I add them using dvc add /PATH/TO/DIR inside DIR I have all the annotation files
  • Here is the output of dvc Doctor

DVC version: 2.45.1 (pip)

Platform: Python 3.8.16 on Linux-4.15.0-132-generic-x86_64-with-glibc2.27
Subprojects:
dvc_data = 0.40.3
dvc_objects = 0.19.3
dvc_render = 0.2.0
dvc_task = 0.1.11
dvclive = 2.1.0
scmrepo = 0.1.11
Supports:
http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
s3 (s3fs = 2023.1.0, boto3 = 1.24.96)
Cache types: reflink, hardlink, symlink
Cache directory: btrfs on /dev/nvme0n1p3
Caches: local
Remotes: s3
Workspace directory: btrfs on /dev/nvme0n1p3
Repo: dvc, git

I would like to add that DVC push takes too long for only 4GB of data, using boto3 I can upload 900 GB in the same time (16 h). Have you tried to improve this by using something like this amazon web services - How can I increase my AWS s3 upload speed when using boto3? - Stack Overflow ?

@ecram Wait, it took 16h to upload that? That’s not normal, we are probably leaking some resources somewhere before actually using boto. The 1.6M of files is probably the cause, we need to take a closer look.

Hi @kupruser ,

Indeed and it failed to upload a few files, the message said that put_object failed, I had a similar issue when using boto3, I solve it using the link I posted above (TransferManager).

Please let me know if I can help you with some additional information.