How do I get the remote url of my local files?

Hi everyone.

I set up two remote storage locations (one for data storage and another for model storage) and examining them today, I managed to dvc push all my files to both locations. My intention was to specify for each file the remote location it should go to, not have all of them in both.

  1. How can I check the remote storage location of a file? Since each file is saved in two locations, I am assuming the dvc file is connected to only one location, but I have been unable to confirm which one.
  2. How can I properly clean up my remote storage so each file is stored only once in one of the two remote storage locations.

I am new to dvc so would appreciate any input. Thanks!

How can I check the remote storage location of a file?

You can use dvc get --show-url from the command line and dvc.api.get_url from the Python API to get the remote URL for files. Note that dvc get --show-url will only show the URL for your default remote, but the path for other remotes will be the same.

So if get --show-url path/to/repo path/to/file returns something like:

s3://models-remote/c8/d307aa005d6974a8525550956d5fb3

You can replace the first part of the URL with other remotes as needed, i.e. that file would also be stored in:

s3://data-remote/c8/d307aa005d6974a8525550956d5fb3

Please refer to the docs for more details:

Since each file is saved in two locations, I am assuming the dvc file is connected to only one location, but I have been unable to confirm which one.

The .dvc file does not associate data with any specific remote by default. Data can be pushed and pulled to any remote, but commands like dvc push and dvc pull will use whichever remote you have configured as the default unless you specify a specific remote to use with the -r/--remote flags.

If you would like to specify a particular remote that should be used when pushing/pulling a particular file, you can manually set the remote field in the .dvc file. So you could have something like

model.dvc:

outs:
  - md5: a304afb96060aad90176268345e10355
    path: mymodel
    remote: models-remote

data.dvc

outs:
  - md5: a304afb96060aad90176268345e10355
    path: mydata
    remote: data-remote

How can I properly clean up my remote storage so each file is stored only once in one of the two remote storage locations.

Regarding cleaning up your remote storage, dvc gc -c is what you normally use to clean up a remote, but in this case it may be easier for you to just manually remove the files you don’t want in each specific remote.