Access remote data instead of downloading it

ecram · February 22, 2023, 8:19am

Hello,

I would like to version a Dataset of 20 TB that continue to increase in a bucket S3.

Some colleagues and myself will run some experiments using this data on our PCs, there is a way to access the remote (using boto3 or other option) data instead of downloading it with dvc pull ?

dberenbaum · February 22, 2023, 2:45pm

Hi @ecram, I have some questions to help determine what would be best.

How is the data generated? Do you generate it locally and dvc push to s3, or is it updated directly in s3?
Since you have 20 TB, will you be running experiments on all of it or some subset? If you will use a subset, how do you select the subset?
Are files ever modified or deleted, or are they only added?
Do you have versioning enabled on your s3 bucket?

ecram · February 28, 2023, 2:37pm

Hi dberenbaum,

Please find my answers in bold.

How is the data generated? Do you generate it locally and dvc push to s3, or is it updated directly in s3? We collect it and then upload it to our S3 bucket
Since you have 20 TB, will you be running experiments on all of it or some subset? If you will use a subset, how do you select the subset? Most of the time we run experiment using the full 20 TB of data
Are files ever modified or deleted, or are they only added? We mainly add new files, but we have a few rare cases that require that we update our files (for example correct the annotated value)
Do you have versioning enabled on your s3 bucket? No, we only have the raw data.

dberenbaum · February 28, 2023, 2:53pm

You mention that the annotations might change. What is the structure of your data? Do you have separate annotations and raw data (like images or text)? What format are the annotations and how do they refer to the raw data? I’m wondering whether it is worthwhile to version all 20 TB or only the annotations, especially if the raw data is never modified or deleted.

Also, are you able to enable versioning on your s3 bucket? Take a look at Using versioning in S3 buckets - Amazon Simple Storage Service if you are unfamiliar. That would mean s3 would automatically version all the data for you already.

ecram · March 1, 2023, 2:46pm

That’s a good idea, thank you for your support.

Currently I am trying to add only the annotations (small files that sum up ~ 4 GB) but dvc add takes more than 1 hour to complete. Is this normal ?

dberenbaum · March 1, 2023, 4:40pm

It’s not unusual that DVC is slow with many small files (we are working on improving it), but an hour for 4 GB sounds longer than expected. Can you follow up with this info:

How many small files do you have?
How are you adding them and where are they stored?
Can you share the output of dvc doctor?

ecram · March 2, 2023, 7:44am

Sure:

I have 1689965 files
I add them using dvc add /PATH/TO/DIR inside DIR I have all the annotation files
Here is the output of dvc Doctor

DVC version: 2.45.1 (pip)

Platform: Python 3.8.16 on Linux-4.15.0-132-generic-x86_64-with-glibc2.27
Subprojects:
dvc_data = 0.40.3
dvc_objects = 0.19.3
dvc_render = 0.2.0
dvc_task = 0.1.11
dvclive = 2.1.0
scmrepo = 0.1.11
Supports:
http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
s3 (s3fs = 2023.1.0, boto3 = 1.24.96)
Cache types: reflink, hardlink, symlink
Cache directory: btrfs on /dev/nvme0n1p3
Caches: local
Remotes: s3
Workspace directory: btrfs on /dev/nvme0n1p3
Repo: dvc, git

I would like to add that DVC push takes too long for only 4GB of data, using boto3 I can upload 900 GB in the same time (16 h). Have you tried to improve this by using something like this amazon web services - How can I increase my AWS s3 upload speed when using boto3? - Stack Overflow ?

kupruser · March 2, 2023, 10:09pm

@ecram Wait, it took 16h to upload that? That’s not normal, we are probably leaking some resources somewhere before actually using boto. The 1.6M of files is probably the cause, we need to take a closer look.

ecram · March 3, 2023, 9:11am

Hi @kupruser ,

Indeed and it failed to upload a few files, the message said that put_object failed, I had a similar issue when using boto3, I solve it using the link I posted above (TransferManager).

Please let me know if I can help you with some additional information.

Topic		Replies	Views
Is it Possible to train data in s3 bucket without downloading to local machine with DVC? Questions	11	767	March 22, 2023
Working on remote data Questions	1	143	April 29, 2024
Best practice for handling large data Questions	5	2585	April 16, 2021
Best practices for data stored on cloud? Questions	0	384	January 4, 2023
Tracking files stored in S3 without adding it into local storage Questions	4	1085	July 5, 2023

Access remote data instead of downloading it

DVC version: 2.45.1 (pip)

Related topics