Restrict access to DVC repo

Does DVC have any support for restricting access to the DVC repo in order to make sure data is not lost / corrupted?

The use case is this: We are a group of data scientists / developers working on the same data. Data is medical data. It is not allowed to leave our premises (i.e. it cannot be stored in the cloud). Data cannot be reproduced / recovered if lost. We are doing our day-to-day work on a number of machines (3 shared machines and our workstations). We are developing medical software so we are required to keep track of each and every change to the data and we must also be able to reproduce any software built 5 years back in time.
We currently have the following setup: All code is placed in a git repository. We have a number of DVC repos in our git repository. The DVC repos are all stored remotely on a NAS (with off-site backup). The nas is mounted in the same location on all machines. Remote storage is accessed via this mount point.
The problem is this: Since the remote storage is mounted on all machines and everyone needs to be able to dvc push to the remote storage, everyone needs to have write access directly into the DVC remote storage. This is very fragile. If someone accidentally deletes /mnt/nas/DVC all our data is lost. I guess the problem is the same even when storing data in the cloud.
Is there a way to restrict access to the remote storage such that only DVC can read/write the data?
I’m thinking: If DVC was running as a daemon with certain privileges (like the docker daemon) everything would be fine. Is something like this possible?

Thanks :slight_smile:

2 Likes

We saw similar requirements in medical companies that use DVC.

First, files in DVC cache are read-only by default. If you use NFS as a local remote you should have the same read-only files (actually, it depends on NFS mounting setting). Unfortunately, it does not protect your data files from removals.

In some file systems (NTFS) and clouds (S3) you can make the cache safer by restricting the directory from removing files while allowing creating files. These permissions together with read-only files gives protection to the cache. I’m not sure it helps with NFS.

Second, I’d suggest separating development and production NAS directories and establish a data productionization process around it. In a simple and naive case, you might have a dedicated person who is the only person who has the write-access to production (I have seen that in medical companies). A more mature process can be set up through CI/CD (only CI/CD “robot” can push data to prod) if your team is comfortable with GitFlow and Pull Requests.

CI/CD example with GitHub Actions (you can set up the same with GitLab CI/CD and NAS):

name: push-data-to-prod-storage
on:
  push:
    branches:
      - master  # it triggers only for merges/pushes to master branch
jobs:
        [SKIP SETUP PART]
        run: |
          # Pull dataset from dev storage (default) a well as production
          dvc pull   # from development
          dvc pull --remote myprod
          
          # Do what is needed (even training)
          dvc repro 
          
          # Push data to dev as well as production
          dvc push
          dvc push --remote myprod

Pros of the CI/CD approach:

  • access control to prod data by setting Merge rules in GitHub/GitLab (like 2+ approvals are required or only after a particular person approval it can be merged)
  • data compliance - GitHub/GitLab gives you a history of who and when modified data

Cons of the CI/CD approach:

  • the process around “data productionization” and Pull Requests might be a bit heavy for some teams
  • data duplication in dev & prod directories (not solved in DVC yet)

@andreassand please let me know if you have any questions.

1 Like