I would like to organize data registry with multiple datasets and multiple users (data scientists from different teams). Since some datasets contain sensitive information, it is important to have access control, both read and write. It would be great to have as flexible setup as possible. In the simplest case there could be just public/private datasets, but RBAC/ABAC or the possibility to provide dataset access to specific user directly (create many-to-many access control record) would be much better.
DVC does not provide such features, but there are some workarounds, e.g here and there. As far as I understood one could create several remotes and to simulate different groups/roles and provide access to different users respectively. For me it seems to be pretty easy to break involuntary such workaround: e.g. having two remote ‘public’ and ‘private’ someone with appropriate access can accidentally push ‘private’ dataset to ‘public’ remote, making it available for anybody with ‘public’ remote access.
There could also be two separate repos (or dvc registries) with public and private, it would be less probable to violate policy accidentally, but creation of e.g. 4 or 5 independent access groups (‘private_for_project_A’, ‘private_for_project_B’) would be messy.
I would be very thankful if anybody could share ideas how to organize more reliable/convenient access control.