This is not such a big of a deal really. import-url
will download it into the cache directly, and then link the file to the workspace (as long as it’s supported by the file system, see this doc for more info.), so there’s no duplication. -O
is meant more for small files you want to track with Git or experimental files you don’t want to track at all.
DVC checks the eTag of the original S3 data source when it needs to determine this e.g. at dvc status
, dvc repro
, etc. (You can use dvc update
to bring it up to date, BTW.) But I thought your data set is not expected to ever change anyway?