Running DVC on AWS Batch

I’m trying to get a DVC job to run on AWS Batch, but I’m having trouble getting dvc to access the s3 bucket.

This is the script I’m running:

#!/bin/bash

AWS_CREDENTIALS=$(curl http://169.254.170.2$AWS_CONTAINER_CREDENTIALS_RELATIVE_URI)

export AWS_DEFAULT_REGION=us-east-1

export AWS_ACCESS_KEY_ID=$(echo "$AWS_CREDENTIALS" | jq .AccessKeyId -r)

export AWS_SECRET_ACCESS_KEY=$(echo "$AWS_CREDENTIALS" | jq .SecretAccessKey -r)

export AWS_SESSION_TOKEN=$(echo "$AWS_CREDENTIALS" | jq .Token -r)

echo "AWS_ACCESS_KEY_ID=<$AWS_ACCESS_KEY_ID>"

echo "AWS_SECRET_ACCESS_KEY=<$AWS_SECRET_ACCESS_KEY>"

echo "AWS_SECRET_ACCESS_KEY=<$(cat <(echo "$AWS_SECRET_ACCESS_KEY" | head -c 6) <(echo -n "...") <(echo "$AWS_SECRET_ACCESS_KEY" | tail -c 6))>"

echo "AWS_SESSION_TOKEN=<$(cat <(echo "$AWS_SESSION_TOKEN" | head -c 6) <(echo -n "...") <(echo "$AWS_SESSION_TOKEN" | tail -c 6))>"

alias python=python3

aws s3 ls s3://duolingo-dvc/

aws s3 ls s3://duolingo-dvc/det-grade/

aws s3 cp s3://duolingo-dvc/det-grade/00/0e4343c163bd70df0a6f9d81e1b4d2 mycopy.txt

dvc remote modify s3 --local region us-east-1

dvc remote modify s3 --local access_key_id $AWS_ACCESS_KEY_ID

dvc remote modify s3 --local secret_access_key $AWS_SECRET_ACCESS_KEY

dvc remote modify s3 --local session_token $AWS_SESSION_TOKEN

echo "Starting DVC pull"

dvc pull -v scripts/writing/data/$1/responses.train-model.csv
dvc pull -v scripts/writing/data/$1/responses.test.csv
dvc pull -v scripts/writing/data/$1/corrected-responses.pkl

echo "Stopping DVC pull"

dvc repro correct-responses@$1 --force --single-item

All the AWS CLI commands, accessing the same bucket, work. However, when it tries to run the DVC pull commands, I get access denied errors, even though it should be using the same credentials.

botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied

Just before that, it says its “Preparing to collect status from ‘duolingo-dvc/det-grade’”.

In case its relevant. The role’s permissions to the s3 bucket are defined as follows:

data "aws_iam_policy_document" "standard-batch-job-role" {
  # S3 read access to related buckets
  statement {
    actions = [
      "s3:Get*",
      "s3:List*",
    ]
    resources = [
      data.aws_s3_bucket.duolingo-dvc.arn,
      "${data.aws_s3_bucket.duolingo-dvc.arn}/*",
    ]
    effect = "Allow"
  }
}

AWS doesn’t make it easy to copy the full stack trace, but here is a screenshot:

AWS’s documentation for accessing the credentials within the AWS Batch container can be found here. I’m pretty sure I’ve pulled these correctly, as otherwise the aws s3 commands would fail, which they do not. I verified this by setting the AWS_ environment variables to incorrect values, and the aws s3 commands do fail in that case.

From the docs (remote add)
Make sure you have the following permissions enabled:

  • s3:ListBucket
  • s3:GetObject
  • s3:PutObject
  • s3:DeleteObject.

This enables the S3 API methods that are performed by DVC (list_objects_v2 or list_objects, head_object, upload_file, download_file, delete_object, copy).

It already has s3:ListBucket and s3:GetObject. It’s failing on list_objects_v2. It should only need the other s3:PutObject and s3:DeleteObject if I’m doing a dvc push or something that mutates the remote store, which I am not.

For the sake of thoroughness, I added DeleteObject and PutObject permissions to the role. As expected, I still get the same error when it tries to call ListObjectsV2.

I’ve traced this down to being an issue with s3fs. Even though the access key ID, access secret, and session token are all being passed into s3fs, for some reason that library’s call to ListObjectsV2 fails, even though it works fine when I run aws s3 ls within the container. I can even reproduce it by writing my own python script that calls s3fs.S3FileSystem(key=...,secret=...,token=...).ls('duolingo-dvc'), which fails when running inside the container (but works locally). I’m not sure where to look next.

Does it succeed if you also provide the region?

fs = S3FileSystem(client_kwargs={"region_name": "eu-west-1"), ...)

Good suggestion, but, no, it doesn’t help.

Here is my latest script:

import os

import s3fs

print(os.environ["AWS_ACCESS_KEY_ID"])
print(os.environ["AWS_SECRET_ACCESS_KEY"])
print(os.environ["AWS_SESSION_TOKEN"])

print("running with credentials")
fs = s3fs.S3FileSystem(
    key=os.environ["AWS_ACCESS_KEY_ID"],
    secret=os.environ["AWS_SECRET_ACCESS_KEY"],
    token=os.environ["AWS_SESSION_TOKEN"],
    client_kwargs={"region_name": "us-east-1"}
)
print(fs.exists("duolingo-dvc/det-grade/00/0e4343c163bd70df0a6f9d81e1b4d2"))
print(fs.ls("duolingo-dvc/"))
print(fs.ls("duolingo-dvc/det-grade/"))

All the credentials print out, so they are there. But, I still get an “access denied” on ListObjectsV2 when running fs.exists.

The aws cli commands, using the same credentials, work fine.

I’ve tried to reproduce it, but I can’t.

Could you try to run aws s3api list-objects-v2 --bucket duolingo-dvc ?

Does fs.ls fail also?

Have you tried fs = s3fs.S3FileSystem()? That should make s3fs use boto’s default credentials resolution.

Another thing to try is to turn on s3fs debug logs.

aws s3api list-objects-v2 --bucket duolingo-dvc succeeds.

Using fs = s3fs.S3FileSystem().exists() and fs = s3fs.S3FileSystem().ls() does not.

Any updates on this? I am currently facing the same issue running dvc pull with remote s3 within a Gitlab CICD pipeline. I used the same IAM role locally and in the pipeline. The error just shows in the CICD pipeline.