Best practices for collaborating with DVC

Hello all!

I am new to dvc and I am trying to figure out how to fit it into our workflows.

The first use case that I am trying to figure out is collaborating on building a dataset, particularly with trainees or collaborators of unknown skill. (Imagine we are all working on labelling new data).

What are best practices for reviewing the trainees work before allowing them to do a dvc push?

Currently we use gitlab UI to require merge reviews for all code changes, and this allows me to visually inspect all changes before allowing a merge. What is the equivalent for dvc?

Thanks for any input!

2 Likes

Hi @jshube

So you are reviewing every label? And may I ask how do these labels look? Are they stored in the same data files or separate, what format is used, are they tracked by Git or DVC?

Thanks

Thanks for your reply.

I may not review every label. Perhaps closer review to start and then tapering off as trust increases. I think my primary concern isnt label quality. I think my primary concern is preventing accidental data tampering / deletion and general repository mucking.

Like I would like to know a summary of what a contributor has done before allowing them to perform a dvc push. Doesnt need to be the content of the files, maybe just changed file names and changed directories.

To answer your other questions: These labels are in the metaheader image format (a medical image format), and they are in their own data files that are currently all being tracked by DVC.

1 Like

I see. dvc push is a remote storage operation, and DVC doesn’t implement any kind of access control for this. It’s left to the storage provider (e.g. IAM profiles for Amazon S3 storage). In general this means users can either always push (and possibly gc), or they never can — so it’s not really linked to the Git flow/ review process.

There’s ways to set things up to work around this, like having a CI task that pushes data automatically after a pull request is merged to master; granular user permissions if using SSH storage; or setting up mirrored storage locations (one for pushing, one for pulling), etc. It’s up to you to engineer this solution for now, but we welcome feature requests!

We do have a command for this: diff :slightly_smiling_face:

1 Like

Hi @jshube! Really good use case. And I think DVC + CML (to generate report about changes) should be used for this.

I think my primary concern is preventing accidental data tampering / deletion and general repository mucking.

Since files in DVC are stored in a content-addressable way (hash, like md5 is used as a file name) DVC storage by default is immutable. Unless you run dvc gc -c or someone go and edit file manually in DVC remote, you are safe by definition. Also, since we do git commit .dvc and other DVC files, it means that history is also immutable - git guarantees this.

So, the simplest way to keep data 100% safe is:

  1. Allow reads, creating of new files, but disable edits and deletions. No should have access to do that. It means that if someone edits a file in the repo a new version will be created instead, and you will always have access to the previous one in the Git history.

  2. Do not allow changing git history (like force push, for example).

Doesnt need to be the content of the files, maybe just changed file names and changed directories.

So, this where CML can be handy. dvc diff can already generate a list of changes. It’s not complete, but we’ll be improving it within this ticket soon - diff: clean up output for changed files · Issue #2982 · iterative/dvc · GitHub

The idea with CML that you run some “action” on every git push and it can help you generate a nice looking report - you can print dvc diff to see changes to the data, besides those that you already see as changes to the .dvc and dvc.lock files.

I hope that answers some of your questions. We would be happy to help you setup CML to do this if you need a hand. Let us know.

2 Likes

Thank you both for your very informative replies!

Based on your feedback I will proceed with something like this:

  1. Have two s3 buckets. One for pulling and one for pushing.
  2. Allow collaborators to only push to the push bucket
  3. Restrict writing to the pull bucket to only a special IAM role
  4. When a collaborator has new labelled data, they create a merge request in GitLab
  5. CI pipeline runs and uses CML to produce a report of the changes in whatever format works for me
  6. I look at the report and if it looks good, proceed with the CI pipeline
  7. Final stage of the CI pipeline uses the special IAM role to dvc push the new data to the pull bucket
  8. Everyone does a dvc pull to get the new data

I am excited to try out some cool stuff with CML

2 Likes

@jshube two buckets + CI sounds excellent to me! :heart:

Regarding CML and reports - let’s stay in touch and please definitely ping us if you need any help!