Hi @jshube! Really good use case. And I think DVC + CML (to generate report about changes) should be used for this.
I think my primary concern is preventing accidental data tampering / deletion and general repository mucking.
Since files in DVC are stored in a content-addressable way (hash, like md5 is used as a file name) DVC storage by default is immutable. Unless you run
dvc gc -c or someone go and edit file manually in DVC remote, you are safe by definition. Also, since we do
.dvc and other DVC files, it means that history is also immutable - git guarantees this.
So, the simplest way to keep data 100% safe is:
Allow reads, creating of new files, but disable edits and deletions. No should have access to do that. It means that if someone edits a file in the repo a new version will be created instead, and you will always have access to the previous one in the Git history.
Do not allow changing git history (like force push, for example).
Doesnt need to be the content of the files, maybe just changed file names and changed directories.
So, this where CML can be handy.
dvc diff can already generate a list of changes. It’s not complete, but we’ll be improving it within this ticket soon - https://github.com/iterative/dvc/issues/2982
The idea with CML that you run some “action” on every git push and it can help you generate a nice looking report - you can print
dvc diff to see changes to the data, besides those that you already see as changes to the
I hope that answers some of your questions. We would be happy to help you setup CML to do this if you need a hand. Let us know.