Dataset-level access control for data registry

Hello!
I would like to organize data registry with multiple datasets and multiple users (data scientists from different teams). Since some datasets contain sensitive information, it is important to have access control, both read and write. It would be great to have as flexible setup as possible. In the simplest case there could be just public/private datasets, but RBAC/ABAC or the possibility to provide dataset access to specific user directly (create many-to-many access control record) would be much better.

DVC does not provide such features, but there are some workarounds, e.g here and there. As far as I understood one could create several remotes and to simulate different groups/roles and provide access to different users respectively. For me it seems to be pretty easy to break involuntary such workaround: e.g. having two remote ‘public’ and ‘private’ someone with appropriate access can accidentally push ‘private’ dataset to ‘public’ remote, making it available for anybody with ‘public’ remote access.
There could also be two separate repos (or dvc registries) with public and private, it would be less probable to violate policy accidentally, but creation of e.g. 4 or 5 independent access groups (‘private_for_project_A’, ‘private_for_project_B’) would be messy.

I would be very thankful if anybody could share ideas how to organize more reliable/convenient access control.

2 Likes

@zimka it is a great question. There are some ways of organizing this - you’ve already provided the links to the workarounds.

Also, we are thinking of creating special ACL features in DVC to provide a more flexible way of doing it. Now we collect requirements for this project.

It would be great to set up a chat with you and talk about an ideal scenario and the requirements that you have. Please let me know what is the best time this week for the chat. My email - my name at iterative.ai,

@dmitry I have sent you an email, thank you!