"dvc add -external S3://mybucket/data.csv" is failing with access error even after giving correct remote cache configurations

veeresh · April 15, 2021, 7:59am

Hi All,

Am connecting to remote S3 for data and also setting remote dvc cache in same S3.
Following is configure file,

[core]
    remote = s3remote
[cache]
    s3 = s3cache
['remote "s3remote"']
    url = S3://dvc-example
    endpointurl = http://localhost:9000/
    access_key_id = user
    secret_access_key = password
    use_ssl = false
['remote "s3cache"']
    url = s3://dvc-example/cache
	endpointurl = http://localhost:9000/
    access_key_id = user
    secret_access_key = password
    use_ssl = false

Am able to push and pull from remote repository to local.
But when I try to add external data by configuring cache, am getting error.
Both s3cache, s3remote has same credentials, then why is it failing when I add external data in dvc?

Any help would be much appreciated.

Also posted in git - "dvc add -external S3://mybucket/data.csv" is failing with access error even after giving correct remote cache configurations - Stack Overflow

isidentical · April 15, 2021, 8:27am

Can you share the output of dvc doctor and the full traceback of the failing command when runned with -v (eg dvc add -v --external s3://<blah>)

veeresh · April 15, 2021, 10:36am

Hi @isidentical ,

Here is out put of dvc doctor

$ dvc doctor
DVC version: 2.0.17 (exe)

Platform: Python 3.8.8 on Windows-10-10.0.17763-SP0
Supports: All remotes
Cache types: https://error.dvc.org/no-dvc-cache
Caches: local, s3
Remotes: s3
Workspace directory: NTFS on C:
Repo: dvc, git

Here are the logs.

Traceback (most recent call last):
  File "dvc\main.py", line 55, in main
  File "dvc\command\add.py", line 21, in run
  File "dvc\repo\__init__.py", line 49, in wrapper
  File "dvc\repo\scm_context.py", line 14, in run
  File "dvc\repo\add.py", line 123, in add
  File "dvc\repo\add.py", line 195, in _process_stages
  File "dvc\stage\__init__.py", line 437, in save
  File "dvc\stage\__init__.py", line 451, in save_outs
  File "dvc\output\base.py", line 280, in save
  File "dvc\output\base.py", line 208, in exists
  File "dvc\fs\s3.py", line 233, in exists
  File "dvc\fs\s3.py", line 263, in isfile
  File "dvc\fs\s3.py", line 284, in _list_paths
  File "boto3\resources\collection.py", line 83, in __iter__
  File "boto3\resources\collection.py", line 166, in pages
  File "botocore\paginate.py", line 255, in __iter__
  File "botocore\paginate.py", line 332, in _make_request
  File "botocore\client.py", line 357, in _api_call
  File "botocore\client.py", line 676, in _make_api_call
botocore.exceptions.ClientError: An error occurred (InvalidAccessKeyId) when calling the ListObjects operation: The AWS Access Key Id you provided does not exist in our records.

.

jorgeorpinel · April 15, 2021, 5:32pm

Hi @veeresh, I think there’s a confusion here:

You are using s3://dataset/ in your add --external command. I think that DVC tries to connect to a bucket called dataset and to use your default AWS credentials (e.g. in ~/.aws/, as configured for AWS-CLI). Probably the bucket exists but it’s someone else’s, and either you don’t have default AWS creds or they don’t work with that bucket.

To use your configured S3 remote as target, use the following format:

$ dvc add --external remote://s3remote/path/to/dataset

Please let us know if that helps. I’ll update https://dvc.org/doc/user-guide/managing-external-data to clarify on this meanwhile.

BTW, you may be interested in another — usually better — way to add external data. See https://dvc.org/doc/command-reference/add#example-transfer-to-the-cache. Thanks

veeresh · April 16, 2021, 2:25am

Hi @jorgeorpinel ,

Its the same error I get even if I pass full dataset path.
I’m passing the data path along with bucket name like below. Its still the same issue/error.

dvc add -v --external s3://dataset/wine-quality.csv
Does the above command also use locally configured AWS S3 configuration and not DVC remote configurations.

config,
[core]
remote = s3remote
[cache]
s3 = s3cache
[‘remote “s3remote”’]
url = S3://dataset
endpointurl = http://{XYZ}:9000/
access_key_id = user
secret_access_key = password
use_ssl = false
[‘remote “s3cache”’]
url = s3://dataset/cache/
endpointurl = http://{XYZ}:9000/
access_key_id = user
secret_access_key = password
use_ssl = false

This is my Minio bucket structure I have,

PS: I’m using remote hosted Minio server to connect.

veeresh · April 16, 2021, 2:28am

FYI,
Remote pull and push works as follows, but external add is not working for same cache and remote configuration.

veeresh · April 16, 2021, 2:44am

@jorgeorpinel ,

The use case am trying to implement is to track the external data store (s3) remotely by dvc without storing/downloading datasets locally.
The link you mentioned ( https://dvc.org/doc/command-reference/add#example-transfer-to-the-cache) stores data into local system right?
Please let me know if you have any suggestions for this use case other than add --external?

PS: I read ( Example: Transfer to remote storage , --to-remote ),But even here I see data(from https url) is pushed/added to remote and tracked.

jorgeorpinel · April 16, 2021, 2:57am

Yep, it seems you’re still trying to connect to some S3 bucket called dataset on AWS (not in your Minio server). Please try the format I mentioned, here updated with the full path:

$ dvc add --external remote://s3remote/dataset/wine-quality.csv

Everything else looks good in your setup.

jorgeorpinel · April 16, 2021, 3:05am

The other option is add as you discovered (also available with import-url). This is a form of bootstrapping your repo with some external data that you don’t want locally now, but that some other system with a clone of the project will be able to actually download and process on that environment.

I think that add --external (using an external cache) is the only method currently available that ensures the data never gets to the “local” environment (on any machine with a repo clone). Note that still a copy of the data may be created in the external cache, if the Minio/S3 file system doesn’t support reflinks. And if it supports symlinks or hardlinks instead, those need to be configured explicitly in the project before using add --external (see Large Dataset Optimization).

veeresh · April 16, 2021, 4:52am

Thanks @jorgeorpinel , It worked

Topic		Replies	Views
Dvc external output add after changing files data in remote is failing Questions	2	819	April 19, 2021
Dvc get error: Unable to find DVC file Questions	12	3235	June 20, 2021
Failed transfer to remote S3 storage Questions	2	752	April 25, 2022
Specify AWS profile when adding external data from S3 Questions	5	3191	March 21, 2022
Remote s3 cache storage with minio Questions	5	3300	January 30, 2023

"dvc add -external S3://mybucket/data.csv" is failing with access error even after giving correct remote cache configurations

Related topics