Add a remote directory without adding to the cache

jorgeorpinel · July 22, 2020, 4:00pm

Interesting ideas about using a stage with no dependencies, or an external dependency in a 2-stage pipeline Pawel. But if the goal is to download the data with dvc pull like in the other repos @nimrand has, that wouldn’t help, as dvc repro is needed instead instead.

The problem is that for dvc pull to work, the data would need to be pushed first, meaning added manually to the workspace, tracked by DVC (dvc add for example), and dvc pushed to some remote storage. Another way to put in the workspace and track it in a single step would be dvc import-url. But both these methods duplicate the remote storage of the data set.

Another possible workaround is to add the data to the project without moving it, as an external output. This implies setting up an external cache in the S3 location first, and then do something like dvc add s3://s3_data_path. I’m not sure what happens on the S3 at that point though, the data may be duplicated anyway and dvc pull still wouldn’t download it to the workspace, as it’s added externally (never pushed in the first place).

Topic		Replies	Views
Integrate DVC to an existing github repo with S3 Questions	1	1079	October 18, 2021
Remote s3 cache storage with minio Questions	5	3287	January 30, 2023
Tracking files stored in S3 without adding it into local storage Questions	4	1076	July 5, 2023
DVC - can’t I track directly an S3 remote data? Questions	1	1270	July 12, 2019
Dvc external output add after changing files data in remote is failing Questions	2	819	April 19, 2021

Add a remote directory without adding to the cache

Related topics