S3 remote best practices

#1

Does anyone have recommendations about best practices for S3 remote set-up?

I’m specifically interested if there are thoughts around permissions and integrity of data in S3. To enable S3, remotes users would be granted read & write permissions. As long as users use the dvc tooling it appears the data should relatively save. However, granting permissions would allow direct access outside of dvc tooling. This seems to open the possibility of a scenario where the data history could be corrupted.

1 Like

#2

Hi @meby!

Great question! All data history is stored in your git repo. Dvc remote only stores the data itself, which is stored under checksum names, so unless you are directly uploading a file with incorrect checksum(which shouldn’t happen with dvc, since it checks that cache files are not corrupted before uploading them) you should be fine. To help mitigate the risk, you might want to create separate creds for each team member specifically to use them with dvc s3 remote. Also, using different buckets on s3 for different dvc projects is also a good idea when combined with per-project-per-team-member creds. And, of course, backups might be a good idea as well :slight_smile: If you are using dvc pipelines too, then you should be able to reproduce all other data from your source data in single dvc repro command, so that adds an additional level of assurance even if you do lose your data :slight_smile:

Thanks,
Ruslan

0 Likes

#3

Hi @kupruser,

Thanks for the quick reply. I like the suggestion per project and per team creds.

I also like the ability to reproduce from source data if intermediates are lost. One of my main concerns of course is not losing the source data and also ensuring that we don’t lose the final generated artifacts that would be deployed to production. One thing we’re evaluating is whether we can configure DVC & S3 so that we can store those in S3 or if we should rely on external archiving where a majority of users would be granted only read access.

With S3, I believe we could turn on versioning at the bucket level and not grant s3:DeleteObject permissions. This would ensure users could create and update objects but never delete and we’d have a version history for all updates if we needed to roll back.

The DVC docs for S3 list s3:DeleteObject as a required permission. Is this necessary for “every day” workflows?

Specifically, I’m trying to understand how this relates to my mental model of git history. Once a specific commit hash is in the repo it is there forever unless you’re rewriting the history which is not an “every day” workflow.

Thanks,
Matt

0 Likes

#4

Hi @meby !

s3:DeleteObject is not required, unless you are using dvc gc (garbage collector), so your proposed way of not granting s3:DeleteObject would definitely work for day-to-day workflow :slight_smile:

Thanks,
Ruslan

0 Likes