Compliance - administering data that is under DVC control

michaelwdownie · April 25, 2019, 2:47pm

Hi there

We are currently looking to implement DVC on our stack.

One thing that isn’t clear from the available documentation, however, is how our data compliance team would go about administering and managing our data store using DVC. What kind of user access control is available to us? Or is this based on our existing Git/Bitbucket setup?

We’ve been through the tutorials available to us and we understand how to add data to the repository, branch off, update etc, but what we’re looking for clarity on is how we would centrally manage this for multiple users.

Any advice or guidance would be appreciated.

EDIT: To add clarity to the question, our requirement is for all of our incoming data sources to be handled by a specialist compliance team who will act as gatekeepers to the data and make it available to the data scientists within the firm as needed. The intention is for our DS team to pull data as is needed but not be able to update it without approval from the compliance team. Thanks.

dmitry · April 26, 2019, 12:24am

Hi!

Great question. DVC does not have any embedded features for access control yet but it relies on the access control models of your code repositories and storages.

code: use Git\Bitbucket access setting to gain access for projects meta information
storage: use S3\GCP\SSH access setting per project (per S3 bucket which is DVC remote)

If we assume that you use separate Git repos per project (not a mono repo) and a separate data remote per project (S3 bucket for example) then you can give access by managing these two services - the repository and the S3 bucket.

Your compliance team can create a separate repository per data source (or set of the sources) with the right repository and data bucket setting and provide it to the right data scientists.

shcheklein · April 26, 2019, 12:27am

Hi @michaelwdownie!

Would you mind to clarify a few things? Do all data scientists have more or less the same access rights to the data that was released by the compliance team?

How do you enforce these policies now?

michaelwdownie · April 26, 2019, 8:00am

Yeah, the intention is for the data scientists to have the same level of access to read data, create branches and issue pull requests, which the compliance team would then be required to approve/deny after a review.

Currently we don’t have a process like this yet. As it stands, the data scientists just use a shared folder on a server and there’s no version control on the files they’re accessing, so we have no idea if anyone has changed anything, nor any audit trail or metadata to say what is and isn’t in there. Hence why we’re looking at DVC

michaelwdownie · April 26, 2019, 8:09am

Thank you @dmitry, that’s good to know.

Also great to know we can integrate with our existing Bitbucket service. We won’t be using GCP/S3 for storage, we have a storage server so it’ll be using SSH to connect to that through bitbucket.

shcheklein · April 26, 2019, 6:52pm

Makes sense @michaelwdownie. There are a few ways to organize this with DVC, but since you are looking for a central place where your data scientists and the compliance team collaborate, I would probably got with one git/dvc repo with all data sets you want to audit and expose in it.

Each dataset will be represented by the .dvc meta-file in this repository. Data scientists will be using these files an entry point to the dataset - to pull it to their project, get a link to the actual data, etc. They will be using DVC commands to make an update to this file which then will have to be committed via PR to the master branch of the central registry. Thus you will have access to the previous versions, you will know who changed what, and will have a way to block a change to the “master” version of the dataset.

Btw, it does not mean that you need to create a separate Git repo for this. If you are using a mono-repo where you keep all your data science projects, you can place you datasets here as well.

If it’s all confusing, we can schedule a meeting and help to brainstorm the details of this workflow.

michaelwdownie · April 29, 2019, 12:29pm

@shcheklein that would be great if you could. Could you PM me and we’ll sort something out?

dmitry · April 29, 2019, 9:30pm

@michaelwdownie We’d be happy to chat! Please shoot me an email - dmitry@iterative.ai

It would be helpful if you can provide your availability. We are based in San Francisco, CA (PST time zone).

shcheklein · May 11, 2019, 1:00am

Hey, @michaelwdownie! We are back from the conference and are ready to help you in case your team has any questions, issues. Have you tried to use DVC project as a registry yet?

shcheklein · January 1, 2020, 7:42pm

@michaelwdownie if it’s still relevant, please take a look at this article and the new features we introduced to facilitate Git-like workflow for data. It should be easier to setup and use “data registries” now. Let’s me know what are your thoughts on this .

Topic		Replies	Views
Dvc and S3 permissions management Questions	3	1052	March 9, 2023
S3 remote permissions and integrity best practices Questions	3	2302	March 31, 2019
Dataset-level access control for data registry Questions	2	568	April 5, 2021
Best practices for collaborating with DVC Questions	6	2213	July 18, 2020
Multiple users in one repo Questions	6	2567	May 17, 2021

Compliance - administering data that is under DVC control

Related topics