Interesting ideas about using a stage with no dependencies, or an external dependency in a 2-stage pipeline Pawel. But if the goal is to download the data with dvc pull
like in the other repos @nimrand has, that wouldn’t help, as dvc repro
is needed instead instead.
The problem is that for dvc pull
to work, the data would need to be pushed first, meaning added manually to the workspace, tracked by DVC (dvc add
for example), and dvc push
ed to some remote storage. Another way to put in the workspace and track it in a single step would be dvc import-url
. But both these methods duplicate the remote storage of the data set.
Another possible workaround is to add the data to the project without moving it, as an external output. This implies setting up an external cache in the S3 location first, and then do something like dvc add s3://s3_data_path
. I’m not sure what happens on the S3 at that point though, the data may be duplicated anyway and
dvc pull
still wouldn’t download it to the workspace, as it’s added externally (never pushed in the first place).