I am new to dvc and I am trying to figure out how to fit it into our workflows.
The first use case that I am trying to figure out is collaborating on building a dataset, particularly with trainees or collaborators of unknown skill. (Imagine we are all working on labelling new data).
What are best practices for reviewing the trainees work before allowing them to do a dvc push?
Currently we use gitlab UI to require merge reviews for all code changes, and this allows me to visually inspect all changes before allowing a merge. What is the equivalent for dvc?
So you are reviewing every label? And may I ask how do these labels look? Are they stored in the same data files or separate, what format is used, are they tracked by Git or DVC?
I may not review every label. Perhaps closer review to start and then tapering off as trust increases. I think my primary concern isnt label quality. I think my primary concern is preventing accidental data tampering / deletion and general repository mucking.
Like I would like to know a summary of what a contributor has done before allowing them to perform a dvc push. Doesnt need to be the content of the files, maybe just changed file names and changed directories.
To answer your other questions: These labels are in the metaheader image format (a medical image format), and they are in their own data files that are currently all being tracked by DVC.
I see. dvc push is a remote storage operation, and DVC doesn’t implement any kind of access control for this. It’s left to the storage provider (e.g. IAM profiles for Amazon S3 storage). In general this means users can either always push (and possibly gc), or they never can — so it’s not really linked to the Git flow/ review process.
There’s ways to set things up to work around this, like having a CI task that pushes data automatically after a pull request is merged to master; granular user permissions if using SSH storage; or setting up mirrored storage locations (one for pushing, one for pulling), etc. It’s up to you to engineer this solution for now, but we welcome feature requests!
Hi @jshube! Really good use case. And I think DVC + CML (to generate report about changes) should be used for this.
I think my primary concern is preventing accidental data tampering / deletion and general repository mucking.
Since files in DVC are stored in a content-addressable way (hash, like md5 is used as a file name) DVC storage by default is immutable. Unless you run dvc gc -c or someone go and edit file manually in DVC remote, you are safe by definition. Also, since we do git commit.dvc and other DVC files, it means that history is also immutable - git guarantees this.
So, the simplest way to keep data 100% safe is:
Allow reads, creating of new files, but disable edits and deletions. No should have access to do that. It means that if someone edits a file in the repo a new version will be created instead, and you will always have access to the previous one in the Git history.
Do not allow changing git history (like force push, for example).
Doesnt need to be the content of the files, maybe just changed file names and changed directories.
The idea with CML that you run some “action” on every git push and it can help you generate a nice looking report - you can print dvc diff to see changes to the data, besides those that you already see as changes to the .dvc and dvc.lock files.
I hope that answers some of your questions. We would be happy to help you setup CML to do this if you need a hand. Let us know.