How can I check the remote storage location of a file?
You can use dvc get --show-url
from the command line and dvc.api.get_url
from the Python API to get the remote URL for files. Note that dvc get --show-url
will only show the URL for your default remote, but the path for other remotes will be the same.
So if get --show-url path/to/repo path/to/file
returns something like:
s3://models-remote/c8/d307aa005d6974a8525550956d5fb3
You can replace the first part of the URL with other remotes as needed, i.e. that file would also be stored in:
s3://data-remote/c8/d307aa005d6974a8525550956d5fb3
Please refer to the docs for more details:
Since each file is saved in two locations, I am assuming the dvc file is connected to only one location, but I have been unable to confirm which one.
The .dvc file does not associate data with any specific remote by default. Data can be pushed and pulled to any remote, but commands like dvc push
and dvc pull
will use whichever remote you have configured as the default unless you specify a specific remote to use with the -r/--remote
flags.
If you would like to specify a particular remote that should be used when pushing/pulling a particular file, you can manually set the remote
field in the .dvc
file. So you could have something like
model.dvc:
outs:
- md5: a304afb96060aad90176268345e10355
path: mymodel
remote: models-remote
data.dvc
outs:
- md5: a304afb96060aad90176268345e10355
path: mydata
remote: data-remote
How can I properly clean up my remote storage so each file is stored only once in one of the two remote storage locations.
Regarding cleaning up your remote storage, dvc gc -c
is what you normally use to clean up a remote, but in this case it may be easier for you to just manually remove the files you don’t want in each specific remote.