Best practices for collaborating with DVC

jshube · July 17, 2020, 7:36pm

Hello all!

I am new to dvc and I am trying to figure out how to fit it into our workflows.

The first use case that I am trying to figure out is collaborating on building a dataset, particularly with trainees or collaborators of unknown skill. (Imagine we are all working on labelling new data).

What are best practices for reviewing the trainees work before allowing them to do a dvc push?

Currently we use gitlab UI to require merge reviews for all code changes, and this allows me to visually inspect all changes before allowing a merge. What is the equivalent for dvc?

Thanks for any input!

jorgeorpinel · July 17, 2020, 9:13pm

Hi @jshube

So you are reviewing every label? And may I ask how do these labels look? Are they stored in the same data files or separate, what format is used, are they tracked by Git or DVC?

Thanks

jshube · July 17, 2020, 9:49pm

Thanks for your reply.

I may not review every label. Perhaps closer review to start and then tapering off as trust increases. I think my primary concern isnt label quality. I think my primary concern is preventing accidental data tampering / deletion and general repository mucking.

Like I would like to know a summary of what a contributor has done before allowing them to perform a dvc push. Doesnt need to be the content of the files, maybe just changed file names and changed directories.

To answer your other questions: These labels are in the metaheader image format (a medical image format), and they are in their own data files that are currently all being tracked by DVC.

jorgeorpinel · July 17, 2020, 10:54pm

I see. dvc push is a remote storage operation, and DVC doesn’t implement any kind of access control for this. It’s left to the storage provider (e.g. IAM profiles for Amazon S3 storage). In general this means users can either always push (and possibly gc), or they never can — so it’s not really linked to the Git flow/ review process.

There’s ways to set things up to work around this, like having a CI task that pushes data automatically after a pull request is merged to master; granular user permissions if using SSH storage; or setting up mirrored storage locations (one for pushing, one for pulling), etc. It’s up to you to engineer this solution for now, but we welcome feature requests!

We do have a command for this: diff

shcheklein · July 17, 2020, 11:48pm

Hi @jshube! Really good use case. And I think DVC + CML (to generate report about changes) should be used for this.

I think my primary concern is preventing accidental data tampering / deletion and general repository mucking.

Since files in DVC are stored in a content-addressable way (hash, like md5 is used as a file name) DVC storage by default is immutable. Unless you run dvc gc -c or someone go and edit file manually in DVC remote, you are safe by definition. Also, since we do git commit .dvc and other DVC files, it means that history is also immutable - git guarantees this.

So, the simplest way to keep data 100% safe is:

Allow reads, creating of new files, but disable edits and deletions. No should have access to do that. It means that if someone edits a file in the repo a new version will be created instead, and you will always have access to the previous one in the Git history.
Do not allow changing git history (like force push, for example).

Doesnt need to be the content of the files, maybe just changed file names and changed directories.

So, this where CML can be handy. dvc diff can already generate a list of changes. It’s not complete, but we’ll be improving it within this ticket soon - diff: clean up output for changed files · Issue #2982 · iterative/dvc · GitHub

The idea with CML that you run some “action” on every git push and it can help you generate a nice looking report - you can print dvc diff to see changes to the data, besides those that you already see as changes to the .dvc and dvc.lock files.

I hope that answers some of your questions. We would be happy to help you setup CML to do this if you need a hand. Let us know.

jshube · July 18, 2020, 12:49am

Thank you both for your very informative replies!

Based on your feedback I will proceed with something like this:

Have two s3 buckets. One for pulling and one for pushing.
Allow collaborators to only push to the push bucket
Restrict writing to the pull bucket to only a special IAM role
When a collaborator has new labelled data, they create a merge request in GitLab
CI pipeline runs and uses CML to produce a report of the changes in whatever format works for me
I look at the report and if it looks good, proceed with the CI pipeline
Final stage of the CI pipeline uses the special IAM role to dvc push the new data to the pull bucket
Everyone does a dvc pull to get the new data

I am excited to try out some cool stuff with CML

shcheklein · July 18, 2020, 12:51am

@jshube two buckets + CI sounds excellent to me!

Regarding CML and reports - let’s stay in touch and please definitely ping us if you need any help!

Topic		Replies	Views
Peer reviews with DVC Questions	0	402	July 29, 2022
Is there a review mechanism for pushing dataset through DVC? Questions	1	436	August 10, 2021
Dvc and S3 permissions management Questions	3	1052	March 9, 2023
Compliance - administering data that is under DVC control Questions	9	1943	January 1, 2020
Looking for Workflow Suggestion Questions	2	179	December 21, 2023

Best practices for collaborating with DVC

Related topics