We are currently looking to implement DVC on our stack.
One thing that isn’t clear from the available documentation, however, is how our data compliance team would go about administering and managing our data store using DVC. What kind of user access control is available to us? Or is this based on our existing Git/Bitbucket setup?
We’ve been through the tutorials available to us and we understand how to add data to the repository, branch off, update etc, but what we’re looking for clarity on is how we would centrally manage this for multiple users.
Any advice or guidance would be appreciated.
EDIT: To add clarity to the question, our requirement is for all of our incoming data sources to be handled by a specialist compliance team who will act as gatekeepers to the data and make it available to the data scientists within the firm as needed. The intention is for our DS team to pull data as is needed but not be able to update it without approval from the compliance team. Thanks.
Great question. DVC does not have any embedded features for access control yet but it relies on the access control models of your code repositories and storages.
- code: use Git\Bitbucket access setting to gain access for projects meta information
- storage: use S3\GCP\SSH access setting per project (per S3 bucket which is DVC remote)
If we assume that you use separate Git repos per project (not a mono repo) and a separate data remote per project (S3 bucket for example) then you can give access by managing these two services - the repository and the S3 bucket.
Your compliance team can create a separate repository per data source (or set of the sources) with the right repository and data bucket setting and provide it to the right data scientists.
Would you mind to clarify a few things? Do all data scientists have more or less the same access rights to the data that was released by the compliance team?
How do you enforce these policies now?
Yeah, the intention is for the data scientists to have the same level of access to read data, create branches and issue pull requests, which the compliance team would then be required to approve/deny after a review.
Currently we don’t have a process like this yet. As it stands, the data scientists just use a shared folder on a server and there’s no version control on the files they’re accessing, so we have no idea if anyone has changed anything, nor any audit trail or metadata to say what is and isn’t in there. Hence why we’re looking at DVC
Thank you @dmitry, that’s good to know.
Also great to know we can integrate with our existing Bitbucket service. We won’t be using GCP/S3 for storage, we have a storage server so it’ll be using SSH to connect to that through bitbucket.
Makes sense @michaelwdownie. There are a few ways to organize this with DVC, but since you are looking for a central place where your data scientists and the compliance team collaborate, I would probably got with one git/dvc repo with all data sets you want to audit and expose in it.
Each dataset will be represented by the
.dvc meta-file in this repository. Data scientists will be using these files an entry point to the dataset - to pull it to their project, get a link to the actual data, etc. They will be using DVC commands to make an update to this file which then will have to be committed via PR to the master branch of the central registry. Thus you will have access to the previous versions, you will know who changed what, and will have a way to block a change to the “master” version of the dataset.
Btw, it does not mean that you need to create a separate Git repo for this. If you are using a mono-repo where you keep all your data science projects, you can place you datasets here as well.
If it’s all confusing, we can schedule a meeting and help to brainstorm the details of this workflow.
@shcheklein that would be great if you could. Could you PM me and we’ll sort something out?
@michaelwdownie We’d be happy to chat! Please shoot me an email - firstname.lastname@example.org
It would be helpful if you can provide your availability. We are based in San Francisco, CA (PST time zone).
Hey, @michaelwdownie! We are back from the conference and are ready to help you in case your team has any questions, issues. Have you tried to use DVC project as a registry yet?
@michaelwdownie if it’s still relevant, please take a look at this article and the new features we introduced to facilitate Git-like workflow for data. It should be easier to setup and use “data registries” now. Let’s me know what are your thoughts on this .