"dvc.api.get_url()" is not working for --external outputs

I have added dvc remote external output to track (external cache and external data storage),

  • Following is the .dvc generated,
    path in github: remoteTrack/wine-quality.csv.dvc
    outs:

    • etag: 5d6f24258e3c50bb01a61194b5401f5d
      size: 264426
      path: remote://s3remote/wine-quality.csv
  • But same works if not mentioned as external data, following is .csv for non external
    path in github: wine-quality.csv.dvc
    outs:

    • md5: 5d6f24258e3c50bb01a61194b5401f5d
      size: 264426
      path: wine-quality.csv
      .


Even after mentioning path as “remote://s3remote/wine-quality.csv”, it is not working.
Error: PathMissingError: The path ‘remoteTrack/wine-quality.csv’ does not exist in the target repository neither as a DVC output nor as a Git-tracked file.

What should be the path for remote external data?

Support for external outputs in dvc.api.* is highly experimental, and it seems like it doesn’t work for your use case. If this is a single use case, I’d advise just expanding remote://s3remote/wine-quality.csv to s3://bucket/wine-quality.csv and use boto etc to read it.

1 Like

okay.
I will try with “S3://bucket” instead of remote,
But does that solve the issue?
The problem am facing is when reading the URL using dvc api for different versions of data.
The data is being stored correctly and .dvc files are generated.
I can manually enter s3 URL and read file from that location (or download using boto3 if I know the URL).

Example URLs for different versions of data from cache,
s3://datasource-bucket/cache/559/27fce671701990608798ea11403459
s3://datasource-bucket/cache/5d/6f24258e3c50bb01a61194b5401f5d

By the way, what is your use case that requires using external outputs? It might be just simpler to use regular outputs and push/pull. It is an advanced feature with some parts are still in an experimental mode.

The use case am exploring is trying to track data in remote location without storing/downloading it locally(I will not push or pull data locally).
There can be other systems, which can change data in remote, which again I want to track from git dvc but without downloading it.
And am trying to load different versions of remote data in jupyter notebook using dvc api.(the notebook is in cloud).
Except --external option, other options/features download the data locally right either in cache or in local system?

Actually, you can use --to-remote, which should sync your data from the remote source to the remote storage and then you can use dvc.api.read() etc. See add

1 Like