Python api with remote repo and local cache

Hi,
I am using the Python api to pull a file from remote s3. The call looks like this:

with dvc.api.open(
‘<path_to_file>’,
repo=‘git@github.com:<my_remote_github_repo>’,
rev= ‘<commit_reference>’
) as fd:
df = pd.read_csv(fd)

This works successfully to download the file; however, I’m trying to achieve an increase in speed. I would like the api to download the file from s3 if it doesn’t already exist in my local cache. If it’s in my local cache, then I would like to get the file from there since it would be much faster.

I realize that I could dvc pull the file from the command line and change my api to my local repo; however, I’m trying to keep the code ‘path’ agnostic so that other users (who share the data repo) can run this script without having to modify paths in the code.

I’ve looked through the forum and the api documentation, but haven’t been able to find a solution. Is this something that is possible, or perhaps a potential feature in the future? It would really speed up my workflow.

Thanks,
Mike

1 Like

Hi @mcoppola ! It’s a really good question.

I think caching is not a primary use case for dvc.api at the moment. And the primary reason is that you may run it outside of the DVC repo. I would say you can run it anywhere, even if you don’t have enough space on disk. It’s made to actually avoid making a copy.

Saying that, I realize that caching can be very beneficial. I would suggest you to open a feature request here - github.com/iterative/dvc !

And I hope some workarounds should work in your case:

  1. Make caching effectively part of your code. And check its existence before reading again. There are some things that should be done right - let me know if you need a hand with this. Overall it should a few lines of code if I’m not missing anything.

  2. Use internal DVC api - there is a dvc.Repo() object . (see DVC tests to see some examples). You could initialize and run any DVC command programmatically. In your case you might run dvc pull file if this file is part of the project.

@mcoppola, if you don’t need to use rev and have a cloned repo locally, you can just pass repo arg as the local path to that repo. This will use the already existing cache, and will only then fall back to the remote.

with dvc.api.open(
    ‘<path_to_file>’,
    repo=‘<path_to_repo>,
) as fd:
    df = pd.read_csv(fd)

If you are already inside the repo, you can even leave the repo arg.

Hi @shcheklein,
Thanks for the prompt reply.

Our use case for dvc is such:

  • Files stored in an s3 remote
  • dvc repo is remote
  • our code base that uses the tracked dvc files is a separate repo
  • multiple users that share the code repo will be frequently re-running experiments, so if there is no need to re-download the file then it would increase productivity

I’ll have a look at the page that you linked and put in a feature request.

Regarding the workarounds:

  1. This is what I’m currently doing. I’m checking to see if the file exits (in the local folder where the code is running), if it exists I load it from there, if it doesn’t, I pull using the api from the remote dvc repo (after pulling, I’m storing the file locally). This keeps the code paths consistent from user to user; however, it doesn’t provide version control to ensure we’re all running experiments on the same data. I’d be interested to see how you would implement.
  2. I’ll have a look at this to see if it would work better for us than Option 1). Could you provide a link to the appropriate test example files?

Thanks again for your help.
Mike

Hi @skshetry,
Great suggestion. This works well for individual usage, unfortunately, I’m trying to avoid having a path pointed to a local repo (since multiple users will be different) and I’d like to make use of rev so that we’re all working from the same version of data. I have a look at the Repo class for now and make a feature request.

Thanks,
Mike