I would like to version a Dataset of 20 TB that continue to increase in a bucket S3.
Some colleagues and myself will run some experiments using this data on our PCs, there is a way to access the remote (using boto3 or other option) data instead of downloading it with dvc pull ?
How is the data generated? Do you generate it locally and dvc push to s3, or is it updated directly in s3? We collect it and then upload it to our S3 bucket
Since you have 20 TB, will you be running experiments on all of it or some subset? If you will use a subset, how do you select the subset? Most of the time we run experiment using the full 20 TB of data
Are files ever modified or deleted, or are they only added? We mainly add new files, but we have a few rare cases that require that we update our files (for example correct the annotated value)
Do you have versioning enabled on your s3 bucket? No, we only have the raw data.
You mention that the annotations might change. What is the structure of your data? Do you have separate annotations and raw data (like images or text)? What format are the annotations and how do they refer to the raw data? I’m wondering whether it is worthwhile to version all 20 TB or only the annotations, especially if the raw data is never modified or deleted.
It’s not unusual that DVC is slow with many small files (we are working on improving it), but an hour for 4 GB sounds longer than expected. Can you follow up with this info:
How many small files do you have?
How are you adding them and where are they stored?
@ecram Wait, it took 16h to upload that? That’s not normal, we are probably leaking some resources somewhere before actually using boto. The 1.6M of files is probably the cause, we need to take a closer look.
Indeed and it failed to upload a few files, the message said that put_object failed, I had a similar issue when using boto3, I solve it using the link I posted above (TransferManager).
Please let me know if I can help you with some additional information.