At my team we are experimenting with dvc as an interface to work with trained models and datasets. So far it has been working nicely, but we found a mayor roadblock when it comes to permissions management.
In our ideal scenario, when a teammate gets access to one of our repositories in GitHub that person will only have access to the data from that particular repository. This permission should be given automatically by the fact that the person has access to the repository, with no need to modify roles in AWS.
However, this seems challenging to achieve through S3 + DVC. Our considered solution right now is to have a single S3 bucket with all our DVC repositories. If we grant access to all teammates to the whole bucket it would mean that they will have access to all the company’s data, which is less than ideal. So we were considering having a “Junior” role which we will have to give access manually to whatever specific S3 folders they need and a “Senior” role with access to everything.
This solution is suboptimal as we will need to handle different permissions for GitHub and for S3, which adds overhead.
Is there any other pattern we are missing here? Is there any easy way for a teammate to be given access to ONLY the dvc remote from the GitHub repository automatically?
Hi @Javi, thank you for asking! This is a very good question and two possible solutions:
Workflow level. You can allow write/push access to the bucket only for bots/virtual-users that will push data to the bucket through CI/CD only after approved pull requests in GitHub. While all the users will have read/pull access from the bucket.
User level. The way you describe it. We are in the design stage for this feature. It would be great if you can help us with the design and requirements.
I’d be happy to help you and go deeper into the solutions. Please let me know if you are open to a chat - please shoot me a Hi email to my-first-name at iterative.ai
Has there been any update on the User Level permissions for DVC? My company is also evaluating DVC as a data versioning option and would love to see integration of this with S3.
@coleary There is still no built-in support for this, but there are a few workarounds that can work for this. I haven’t tested any of them, and I think they will require some at least minimal setup, but I can try to help with some feedback and brainstorm the solution.
Would it be possible in your case to assume that there is a remote setup per repository, and is it enough level of granularity if we assume that if personal has an access to a repo also has access to the remote storage? Or at least that we don’t have to manage access on S3 in a way that one person has access to some files and doesn’t have to other.
Then options that I have in mind:
Take a look at saml.to that helps to assume a role based on some configuration file on GH. E.g. you can say that people have an access to a repo or are part of some team have access to a specific bucket on remote.
I like this one better, but it’s a bit more involved. Create a JSON/YAML file in a repo and CI/CD action that would use terraform or AWS CLI to manage team permission as this file is being updated. This way you would be able to explicitly say that users A, B, C have access now.
Alternative is to have a cron scheduled GH action that checks for the team members in some team and updates S3 permissions accordingly.