DVC - can’t I track directly an S3 remote data?

Luc · July 11, 2019, 10:15am

Hi guys,
I am currently evaluating DVC for my team.
I have been trying it out and like some of its features. However there are something I am a bit lost with.
We work on s3, we have a daily pipeline that downloads data from the internet, processes and spits out data in another format.
So far I managed to : configure a remote s3 repo, push a local file to it, play with run and repro. However this does not fit 80% of our workflow. Can you help me understand if the following things are possible:

To push data on dvc, do I necessarily need to have the data on my local machine then push it to the remotely defined location ? Can’t I track directly a remote s3 parquet file ? e.g: dvc add s3://mybucket/folder-a/myfile.parquet -r myRemote ?
If 1 is possible, if someone manually overwrites the s3://mybucket/folder-a/myfile.parquet, how will that affect the integrity of dvc ?

Thanks for the help.

kupruser · July 12, 2019, 1:36pm

Hi @Luc !

Thanks for trying out dvc!

To push data on dvc, do I necessarily need to have the data on my local machine then push it to the remotely defined location ? Can’t I track directly a remote s3 parquet file ? e.g: dvc add s3://mybucket/folder-a/myfile.parquet -r myRemote ?

So you can actually use our external outputs feature for that. If you don’t want to actually cache that s3 file with dvc, but want to incorporate it into your pipeline so that it is triggered when it changes, you could use external dependency feature as simple as:

dvc run -d s3://mybucket/folder-a/myfile.parquet -o someout ... ./myscript

But if you do want to cache it, you could set a remote for s3 ouputs like so:

dvc config cache.s3 myremote

and then just

dvc add s3://mybucket/folder-a/myfile.parquet

Dvc will then be caching your s3://mybucket/folder-a/myfile.parquet file in myremote, so you will be able to dvc checkout it to any version on the s3 itself, no downloading/uploading from/to s3/local will be done. Obviously, if you are working with colleagues, then you’d have to have your own workspace on S3, so you don’t bump into each other (but the remote itself can be the same, so all of your cache is there). Let me know if you are interested in this scenario, so I could share some tips and tricks

If 1 is possible, if someone manually overwrites the s3://mybucket/folder-a/myfile.parquet, how will that affect the integrity of dvc ?

If someone overwrites it, DVC will notice that the eTag of that remote file has been changed and will re-run that DVC-file on dvc repro. It will also show it as changed in dvc status output.

Topic		Replies	Views
Tracking files stored in S3 without adding it into local storage Questions	4	521	July 5, 2023
DVC with remote access to workflow artifacts Questions	1	169	September 28, 2023
Dvc external output add after changing files data in remote is failing Questions	2	720	April 19, 2021
Remote s3 cache storage with minio Questions	5	1866	January 30, 2023
Is it Possible to train data in s3 bucket without downloading to local machine with DVC? Questions	11	536	March 22, 2023

DVC - can’t I track directly an S3 remote data?

Related Topics