Remote s3 cache storage with minio

ChKa · January 24, 2023, 11:15am

Hello,
First of all thank you for your contribution.
I would like to ask if there is a way to keep all my cache of a dataset into a remote minio bucket and not appearing into my local storage.

I have added my dataset into a remote minio bucket

dvc remote add myminio -d s3://abucket/DVC
dvc remote modify myminio endpointurl 'http://...../'
dvc remote modify myminio access_key_id 'user'
dvc remote modify myminio secret_access_key 'pass'

dvc add data
git add .gitignore data.dvc .dvc/config
git commit -m '-Added remote storage,-Added data'
dvc push
rm -r .dvc/cache

But when I try to do a dvc pull, the cache directory always return.

Is there a way to define that my cache folder is my remote?
Also can I use dvc pull on another PC without data.dvc?

Thank you in advance

kupruser · January 24, 2023, 12:40pm

That’s intended behavior. Pull is downloading files to local cache so that you could use them locally.

Also can I use dvc pull on another PC without data.dvc?

you need it in one shape or form to provide the information about what you want to download. Maybe you are looking for get ?

ChKa · January 26, 2023, 8:20am

Hello,

I manage to do something convenient for me by using

dvc import-url data_temp data --to-remote

I uploaded the dvc files on a github and now I can share my dataset with other users.

I would like to know I little bit more about working with dvc pull

First this is my tree of data_temp directory.

folder1
folder2
folder3
file

where folders are like this

folder1
    subfolder
        subfolder_a
            file1
            ....
            file20
        subfolder_b
            file1

I tried to do pull data on another computer and it works
I would like to understand why ‘dvc pull’ works as intended (downloads everything)
but in the scenacio

I want to pull only selected files, I should first create a directory named ‘data_temp’ to proceed with dvc pull data/file
Why I cannot pull selected directory with dvc pull data/folder1

Thank you in advance!

dtrifiro · January 28, 2023, 6:09pm

I’m not 100% sure I understand your usecase, but if you need to pull individual subdirectories, I’d recommend first creating data_temp, then importing subfolders. That will make it possible to pull subfolders:

mkdir  data_temp
dvc import-url <url/to/folder1> data_temp/folder1

with this you will be able to pull: dvc pull data_temp/folder1

Fore more info, please have a look at the docs.

ChKa · January 30, 2023, 8:12am

Thank you for your reply!
Actually this is what I try to achieve, managing-external-data.

My usecase is that I have a large dataset with limited storage, I want to share the dataset without creating unnecessary cache to me or my colleagues. Also, I need to select part of the data (e.g. with a certain tag) or dropping out garbage, this is why I try to pull certain folders from the external storage. I believe the Examples section is what I want to achieve. However, I haven’t managed to do it in minio

I tried import-url before and even though the configuration exist and dvc pull works, dvc import-url does not locate the credentials

ERROR: unexpected error - Unable to locate credentials
I also tried

export AWS_ACCESS_KEY_ID=<user_name>
export AWS_SECRET_ACCESS_KEY=<password>
export AWS_S3_ENDPOINT_URL =http://...:port/

Which returns :

Forbidden: An error occurred (403) when calling the HeadObject operation: Forbidden

dtrifiro · January 30, 2023, 8:50am

Unfortunately there’s currently no way to provide import-url a custom endpoint URL via environment variables (access key/secret key env variables do work though). You could open a feature request on GitHub - iterative/dvc: 🦉Data Version Control | Git for Data & Models | ML Experiments Management.

You can use the following workaround to run import-url with minio:

dvc remote add minio s3://<bucket>
dvc remote modify minio endpoint-url http://<minio host>:<minio port>
dvc remote modify --local minio access_key_id <access key id>
dvc remote modify --local minio secret_access_key <access key>

Then you will be able to use use the remote:// syntax for import url with the previously defined remote:

dvc import-url remote://minio/path/on/bucket

Topic		Replies	Views
Add a remote directory without adding to the cache Questions	16	3216	July 24, 2020
Direct copy between Shared Cache and External Dependencies/Outputs Questions	10	1846	June 3, 2021
"dvc.api.get_url()" is not working for --external outputs Questions	5	1179	April 20, 2021
"dvc add -external S3://mybucket/data.csv" is failing with access error even after giving correct remote cache configurations Questions	9	2489	April 16, 2021
Python api with remote repo and local cache Questions	4	1500	January 27, 2021

Remote s3 cache storage with minio

Related topics