How can I check the remote storage location of a file?
You can use dvc get --show-url from the command line and dvc.api.get_url from the Python API to get the remote URL for files. Note that dvc get --show-url will only show the URL for your default remote, but the path for other remotes will be the same.
So if get --show-url path/to/repo path/to/file returns something like:
s3://models-remote/c8/d307aa005d6974a8525550956d5fb3
You can replace the first part of the URL with other remotes as needed, i.e. that file would also be stored in:
s3://data-remote/c8/d307aa005d6974a8525550956d5fb3
Please refer to the docs for more details:
Since each file is saved in two locations, I am assuming the dvc file is connected to only one location, but I have been unable to confirm which one.
The .dvc file does not associate data with any specific remote by default. Data can be pushed and pulled to any remote, but commands like dvc push and dvc pull will use whichever remote you have configured as the default unless you specify a specific remote to use with the -r/--remote flags.
If you would like to specify a particular remote that should be used when pushing/pulling a particular file, you can manually set the remote field in the .dvc file. So you could have something like
model.dvc:
outs:
- md5: a304afb96060aad90176268345e10355
path: mymodel
remote: models-remote
data.dvc
outs:
- md5: a304afb96060aad90176268345e10355
path: mydata
remote: data-remote
How can I properly clean up my remote storage so each file is stored only once in one of the two remote storage locations.
Regarding cleaning up your remote storage, dvc gc -c is what you normally use to clean up a remote, but in this case it may be easier for you to just manually remove the files you don’t want in each specific remote.