Loading DVC Data in AWS Sagemaker

Greetings!

I’ve been working on a project on my local computer and versioning my data via DVC to an S3 bucket.

Now my project is too big and I need to take advantage of the nearly unlimited seeming RAM available on AWS Sagemaker.

Also, I’ve linked my Gitlab to my notebook instances in Sagemaker so I can see the folder containing the dvc files which I’ve pushed to my Gitlab.

However, everything I see on Sagemaker tutorials says I need to point to the actual file in S3 (see here: https://stackoverflow.com/a/56060184/4691538).

To that end, can someone tell me how to load a file into Sagemaker which is under data versioning control using DVC and stored in S3? Cheers!

1 Like

Hi @Evan

You could get a s3 url with:

$ dvc get https://gitlab.com/user/project path/to/data --show-url

that will give you something like:

s3://bucket/mydvc/90/104d9e83cfb825cf45507e90aadd27  

it is also supported through the python api:

from dvc.api import get_url

url = get_url(repo="https://gitlab.com/user/project", path="path/to/data")

Note that you can also use a local path to your git/dvc repo, e.g. . or path/to/myrepo instead of https://gitlab.com/user/project

2 Likes

Hello @kupruser

Worked like a charm :smiley:

Thanks a million!

1 Like