DVC full s3 data

Hi guys,
I am currently evaluating DVC for my team.
I have been trying it out and like some of its features. However there are something I am a bit lost with.
We work on s3, we have a daily pipeline that downloads data from the internet, processes and spits out data in another format.
So far I managed to : configure a remote s3 repo, push a local file to it, play with run and repro. However this does not fit 80% of our workflow. Can you help me understand if the following things are possible:

  1. To push data on dvc, do I necessarily need to have the data on my local machine then push it to the remotely defined location ? Can’t I track directly a remote s3 parquet file ? e.g: dvc add s3://mybucket/folder-a/myfile.parquet -r myRemote ?
  2. If 1 is possible, if someone manually overwrites the s3://mybucket/folder-a/myfile.parquet, how will that affect the integrity of dvc ?

Thanks for the help.

Hi @Luc !

Thanks for trying out dvc!

  1. To push data on dvc, do I necessarily need to have the data on my local machine then push it to the remotely defined location ? Can’t I track directly a remote s3 parquet file ? e.g: dvc add s3://mybucket/folder-a/myfile.parquet -r myRemote ?

So you can actually use our external outputs feature for that. If you don’t want to actually cache that s3 file with dvc, but want to incorporate it into your pipeline so that it is triggered when it changes, you could use external dependency feature as simple as:

dvc run -d s3://mybucket/folder-a/myfile.parquet -o someout ... ./myscript

But if you do want to cache it, you could set a remote for s3 ouputs like so:

dvc config cache.s3 myremote

and then just

dvc add s3://mybucket/folder-a/myfile.parquet

Dvc will then be caching your s3://mybucket/folder-a/myfile.parquet file in myremote, so you will be able to dvc checkout it to any version on the s3 itself, no downloading/uploading from/to s3/local will be done. Obviously, if you are working with colleagues, then you’d have to have your own workspace on S3, so you don’t bump into each other (but the remote itself can be the same, so all of your cache is there). Let me know if you are interested in this scenario, so I could share some tips and tricks :wink:

  1. If 1 is possible, if someone manually overwrites the s3://mybucket/folder-a/myfile.parquet, how will that affect the integrity of dvc ?

If someone overwrites it, DVC will notice that the eTag of that remote file has been changed and will re-run that DVC-file on dvc repro. It will also show it as changed in dvc status output.

1 Like