"dvc add -external S3://mybucket/data.csv" is failing with access error even after giving correct remote cache configurations

Hi All,

Am connecting to remote S3 for data and also setting remote dvc cache in same S3.
Following is configure file,

[core]
    remote = s3remote
[cache]
    s3 = s3cache
['remote "s3remote"']
    url = S3://dvc-example
    endpointurl = http://localhost:9000/
    access_key_id = user
    secret_access_key = password
    use_ssl = false
['remote "s3cache"']
    url = s3://dvc-example/cache
	endpointurl = http://localhost:9000/
    access_key_id = user
    secret_access_key = password
    use_ssl = false

Am able to push and pull from remote repository to local.
But when I try to add external data by configuring cache, am getting error.
Both s3cache, s3remote has same credentials, then why is it failing when I add external data in dvc?

Any help would be much appreciated.

Also posted in git - "dvc add -external S3://mybucket/data.csv" is failing with access error even after giving correct remote cache configurations - Stack Overflow

Can you share the output of dvc doctor and the full traceback of the failing command when runned with -v (eg dvc add -v --external s3://<blah>)

Hi @isidentical ,

  • Here is out put of dvc doctor

    $ dvc doctor
    DVC version: 2.0.17 (exe)


Platform: Python 3.8.8 on Windows-10-10.0.17763-SP0
Supports: All remotes
Cache types: https://error.dvc.org/no-dvc-cache
Caches: local, s3
Remotes: s3
Workspace directory: NTFS on C:
Repo: dvc, git

  • Here are the logs.
Traceback (most recent call last):
  File "dvc\main.py", line 55, in main
  File "dvc\command\add.py", line 21, in run
  File "dvc\repo\__init__.py", line 49, in wrapper
  File "dvc\repo\scm_context.py", line 14, in run
  File "dvc\repo\add.py", line 123, in add
  File "dvc\repo\add.py", line 195, in _process_stages
  File "dvc\stage\__init__.py", line 437, in save
  File "dvc\stage\__init__.py", line 451, in save_outs
  File "dvc\output\base.py", line 280, in save
  File "dvc\output\base.py", line 208, in exists
  File "dvc\fs\s3.py", line 233, in exists
  File "dvc\fs\s3.py", line 263, in isfile
  File "dvc\fs\s3.py", line 284, in _list_paths
  File "boto3\resources\collection.py", line 83, in __iter__
  File "boto3\resources\collection.py", line 166, in pages
  File "botocore\paginate.py", line 255, in __iter__
  File "botocore\paginate.py", line 332, in _make_request
  File "botocore\client.py", line 357, in _api_call
  File "botocore\client.py", line 676, in _make_api_call
botocore.exceptions.ClientError: An error occurred (InvalidAccessKeyId) when calling the ListObjects operation: The AWS Access Key Id you provided does not exist in our records.

.

1 Like

Hi @veeresh, I think there’s a confusion here:

You are using s3://dataset/ in your add --external command. I think that DVC tries to connect to a bucket called dataset and to use your default AWS credentials (e.g. in ~/.aws/, as configured for AWS-CLI). Probably the bucket exists but it’s someone else’s, and either you don’t have default AWS creds or they don’t work with that bucket.

To use your configured S3 remote as target, use the following format:

$ dvc add --external remote://s3remote/path/to/dataset

Please let us know if that helps. I’ll update https://dvc.org/doc/user-guide/managing-external-data to clarify on this meanwhile.

BTW, you may be interested in another — usually better — way to add external data. See https://dvc.org/doc/command-reference/add#example-transfer-to-the-cache. Thanks

Hi @jorgeorpinel ,

Its the same error I get even if I pass full dataset path.
I’m passing the data path along with bucket name like below. Its still the same issue/error.

  • dvc add -v --external s3://dataset/wine-quality.csv
    Does the above command also use locally configured AWS S3 configuration and not DVC remote configurations.

config,
[core]
remote = s3remote
[cache]
s3 = s3cache
[‘remote “s3remote”’]
url = S3://dataset
endpointurl = http://{XYZ}:9000/
access_key_id = user
secret_access_key = password
use_ssl = false
[‘remote “s3cache”’]
url = s3://dataset/cache/
endpointurl = http://{XYZ}:9000/
access_key_id = user
secret_access_key = password
use_ssl = false

This is my Minio bucket structure I have,

PS: I’m using remote hosted Minio server to connect.

FYI,
Remote pull and push works as follows, but external add is not working for same cache and remote configuration.

@jorgeorpinel ,

The use case am trying to implement is to track the external data store (s3) remotely by dvc without storing/downloading datasets locally.
The link you mentioned ( https://dvc.org/doc/command-reference/add#example-transfer-to-the-cache) stores data into local system right?
Please let me know if you have any suggestions for this use case other than add --external?

PS: I read ( Example: Transfer to remote storage , --to-remote ),But even here I see data(from https url) is pushed/added to remote and tracked.

Yep, it seems you’re still trying to connect to some S3 bucket called dataset on AWS (not in your Minio server). Please try the format I mentioned, here updated with the full path:

$ dvc add --external remote://s3remote/dataset/wine-quality.csv

Everything else looks good in your setup.

3 Likes

The other option is add as you discovered (also available with import-url). This is a form of bootstrapping your repo with some external data that you don’t want locally now, but that some other system with a clone of the project will be able to actually download and process on that environment.

I think that add --external (using an external cache) is the only method currently available that ensures the data never gets to the “local” environment (on any machine with a repo clone). Note that still a copy of the data may be created in the external cache, if the Minio/S3 file system doesn’t support reflinks. And if it supports symlinks or hardlinks instead, those need to be configured explicitly in the project before using add --external (see Large Dataset Optimization).

2 Likes

Thanks @jorgeorpinel , It worked :slight_smile:

2 Likes