Hi All,
Am connecting to remote S3 for data and also setting remote dvc cache in same S3.
Following is configure file,
[core]
remote = s3remote
[cache]
s3 = s3cache
['remote "s3remote"']
url = S3://dvc-example
endpointurl = http://localhost:9000/
access_key_id = user
secret_access_key = password
use_ssl = false
['remote "s3cache"']
url = s3://dvc-example/cache
endpointurl = http://localhost:9000/
access_key_id = user
secret_access_key = password
use_ssl = false
Am able to push and pull from remote repository to local.
But when I try to add external data by configuring cache, am getting error.
Both s3cache, s3remote has same credentials, then why is it failing when I add external data in dvc?
Any help would be much appreciated.
Also posted in git - "dvc add -external S3://mybucket/data.csv" is failing with access error even after giving correct remote cache configurations - Stack Overflow
Can you share the output of dvc doctor
and the full traceback of the failing command when runned with -v
(eg dvc add -v --external s3://<blah>
)
Hi @isidentical ,
Platform: Python 3.8.8 on Windows-10-10.0.17763-SP0
Supports: All remotes
Cache types: https://error.dvc.org/no-dvc-cache
Caches: local, s3
Remotes: s3
Workspace directory: NTFS on C:
Repo: dvc, git
Traceback (most recent call last):
File "dvc\main.py", line 55, in main
File "dvc\command\add.py", line 21, in run
File "dvc\repo\__init__.py", line 49, in wrapper
File "dvc\repo\scm_context.py", line 14, in run
File "dvc\repo\add.py", line 123, in add
File "dvc\repo\add.py", line 195, in _process_stages
File "dvc\stage\__init__.py", line 437, in save
File "dvc\stage\__init__.py", line 451, in save_outs
File "dvc\output\base.py", line 280, in save
File "dvc\output\base.py", line 208, in exists
File "dvc\fs\s3.py", line 233, in exists
File "dvc\fs\s3.py", line 263, in isfile
File "dvc\fs\s3.py", line 284, in _list_paths
File "boto3\resources\collection.py", line 83, in __iter__
File "boto3\resources\collection.py", line 166, in pages
File "botocore\paginate.py", line 255, in __iter__
File "botocore\paginate.py", line 332, in _make_request
File "botocore\client.py", line 357, in _api_call
File "botocore\client.py", line 676, in _make_api_call
botocore.exceptions.ClientError: An error occurred (InvalidAccessKeyId) when calling the ListObjects operation: The AWS Access Key Id you provided does not exist in our records.
.
1 Like
Hi @veeresh, I think there’s a confusion here:
You are using s3://dataset/
in your add --external
command. I think that DVC tries to connect to a bucket called dataset
and to use your default AWS credentials (e.g. in ~/.aws/
, as configured for AWS-CLI). Probably the bucket exists but it’s someone else’s, and either you don’t have default AWS creds or they don’t work with that bucket.
To use your configured S3 remote as target, use the following format:
$ dvc add --external remote://s3remote/path/to/dataset
Please let us know if that helps. I’ll update https://dvc.org/doc/user-guide/managing-external-data to clarify on this meanwhile.
BTW, you may be interested in another — usually better — way to add external data. See https://dvc.org/doc/command-reference/add#example-transfer-to-the-cache. Thanks
Hi @jorgeorpinel ,
Its the same error I get even if I pass full dataset path.
I’m passing the data path along with bucket name like below. Its still the same issue/error.
- dvc add -v --external s3://dataset/wine-quality.csv
Does the above command also use locally configured AWS S3 configuration and not DVC remote configurations.
config,
[core]
remote = s3remote
[cache]
s3 = s3cache
[‘remote “s3remote”’]
url = S3://dataset
endpointurl = http://{XYZ}:9000/
access_key_id = user
secret_access_key = password
use_ssl = false
[‘remote “s3cache”’]
url = s3://dataset/cache/
endpointurl = http://{XYZ}:9000/
access_key_id = user
secret_access_key = password
use_ssl = false
This is my Minio bucket structure I have,
PS: I’m using remote hosted Minio server to connect.
FYI,
Remote pull and push works as follows, but external add is not working for same cache and remote configuration.
@jorgeorpinel ,
The use case am trying to implement is to track the external data store (s3) remotely by dvc without storing/downloading datasets locally.
The link you mentioned ( https://dvc.org/doc/command-reference/add#example-transfer-to-the-cache) stores data into local system right?
Please let me know if you have any suggestions for this use case other than add --external?
PS: I read ( Example: Transfer to remote storage , --to-remote ),But even here I see data(from https url) is pushed/added to remote and tracked.
Yep, it seems you’re still trying to connect to some S3 bucket called dataset
on AWS (not in your Minio server). Please try the format I mentioned, here updated with the full path:
$ dvc add --external remote://s3remote/dataset/wine-quality.csv
Everything else looks good in your setup.
3 Likes
The other option is add as you discovered (also available with import-url
). This is a form of bootstrapping your repo with some external data that you don’t want locally now, but that some other system with a clone of the project will be able to actually download and process on that environment.
I think that add --external
(using an external cache) is the only method currently available that ensures the data never gets to the “local” environment (on any machine with a repo clone). Note that still a copy of the data may be created in the external cache, if the Minio/S3 file system doesn’t support reflinks. And if it supports symlinks or hardlinks instead, those need to be configured explicitly in the project before using add --external
(see Large Dataset Optimization).
2 Likes
Thanks @jorgeorpinel , It worked
2 Likes