Python api with remote repo and local cache

mcoppola · January 26, 2021, 4:24pm

Hi,
I am using the Python api to pull a file from remote s3. The call looks like this:

with dvc.api.open(
‘<path_to_file>’,
repo=‘git@github.com:<my_remote_github_repo>’,
rev= ‘<commit_reference>’
) as fd:
df = pd.read_csv(fd)

This works successfully to download the file; however, I’m trying to achieve an increase in speed. I would like the api to download the file from s3 if it doesn’t already exist in my local cache. If it’s in my local cache, then I would like to get the file from there since it would be much faster.

I realize that I could dvc pull the file from the command line and change my api to my local repo; however, I’m trying to keep the code ‘path’ agnostic so that other users (who share the data repo) can run this script without having to modify paths in the code.

I’ve looked through the forum and the api documentation, but haven’t been able to find a solution. Is this something that is possible, or perhaps a potential feature in the future? It would really speed up my workflow.

Thanks,
Mike

shcheklein · January 27, 2021, 3:30am

Hi @mcoppola ! It’s a really good question.

I think caching is not a primary use case for dvc.api at the moment. And the primary reason is that you may run it outside of the DVC repo. I would say you can run it anywhere, even if you don’t have enough space on disk. It’s made to actually avoid making a copy.

Saying that, I realize that caching can be very beneficial. I would suggest you to open a feature request here - github.com/iterative/dvc !

And I hope some workarounds should work in your case:

Make caching effectively part of your code. And check its existence before reading again. There are some things that should be done right - let me know if you need a hand with this. Overall it should a few lines of code if I’m not missing anything.
Use internal DVC api - there is a dvc.Repo() object . (see DVC tests to see some examples). You could initialize and run any DVC command programmatically. In your case you might run dvc pull file if this file is part of the project.

skshetry · January 27, 2021, 4:16am

@mcoppola, if you don’t need to use rev and have a cloned repo locally, you can just pass repo arg as the local path to that repo. This will use the already existing cache, and will only then fall back to the remote.

with dvc.api.open(
    ‘<path_to_file>’,
    repo=‘<path_to_repo>,
) as fd:
    df = pd.read_csv(fd)

If you are already inside the repo, you can even leave the repo arg.

mcoppola · January 27, 2021, 5:38pm

Hi @shcheklein,
Thanks for the prompt reply.

Our use case for dvc is such:

Files stored in an s3 remote
dvc repo is remote
our code base that uses the tracked dvc files is a separate repo
multiple users that share the code repo will be frequently re-running experiments, so if there is no need to re-download the file then it would increase productivity

I’ll have a look at the page that you linked and put in a feature request.

Regarding the workarounds:

This is what I’m currently doing. I’m checking to see if the file exits (in the local folder where the code is running), if it exists I load it from there, if it doesn’t, I pull using the api from the remote dvc repo (after pulling, I’m storing the file locally). This keeps the code paths consistent from user to user; however, it doesn’t provide version control to ensure we’re all running experiments on the same data. I’d be interested to see how you would implement.
I’ll have a look at this to see if it would work better for us than Option 1). Could you provide a link to the appropriate test example files?

Thanks again for your help.
Mike

mcoppola · January 27, 2021, 5:44pm

Hi @skshetry,
Great suggestion. This works well for individual usage, unfortunately, I’m trying to avoid having a path pointed to a local repo (since multiple users will be different) and I’d like to make use of rev so that we’re all working from the same version of data. I have a look at the Repo class for now and make a feature request.

Thanks,
Mike

Topic		Replies	Views
Remote s3 cache storage with minio Questions	5	3350	January 30, 2023
Add a remote directory without adding to the cache Questions	16	3208	July 24, 2020
DVC - can’t I track directly an S3 remote data? Questions	1	1272	July 12, 2019
Important: ERROR: failed to pull data from the cloud Questions	10	856	February 16, 2024
Looking for Workflow Suggestion Questions	2	186	December 21, 2023

Python api with remote repo and local cache

Related topics