Ensuring dataset quality when maintaining a data registry

rnaves · October 21, 2025, 12:01am

In general terms, I would like to know how do you ensure dataset quality in a registry? Maybe I am taking the wrong approach, and would like to know has anyone else got a solution for this?

In the context of setting up a production enterprise grade data registry.

A consistent DevOps approach I’ve used before when it comes to setting up repos and pipelines is to ensure that there are check before committing anything to a main branch. Usually by setting up a CI/CD solution normally one or more pipelines. (depending on the type of project/repo).

Generally a team will be able to contribute to a repo by cloning the repo, creating a feature branch, making changes and pushing into the remote. At this point a PR can be created and it triggers a pipeline which can run multiple tests and checks, and normally a manual approval is required, before being able to merge code and contribute.

The same thinking should be applied to a dataset registry. In this paradigm of using git to track datasets. I would expect that any contributor should go through a similar process of checks and approvals before being able to contribute.

The challenge is that once a remote is setup, any dev could clone a repo used as a dataset registry, with dvc initialised and before having anyone check their work, they can run dvc add, commit their.dvc files and dvc push. (Assuming the given dev has access keys).

My initial thought was, to setup two separate buckets in my remote, a staging bucket and a production bucket. In my dvc config setup two remotes to match the buckets.
All devs can contribute their datasets to the staging bucket, without going through any approval pipeline. They could dvc add, git add and commit and dvc push in their feature branch.
Once they push code to git remote and create a PR, a pipeline execution could run, including a manual approval, and then with the pipeline runner setup with the production bucket keys. Run a dvc pull from staging bucket, and dvc push into the prod bucket.
This has it’s own challenges. But in general terms I think I am explaining my idea.

Another approach is to use an s3 sync operation between staging and prod bucket. and update the .dvc file using dvc pull -r staging –dry. This approach would save a massive amount of load on the runner.
How do you manage, a dataset registry in a DevOps consistent approach?
Has anyone done anything similar before?
Or do you just create a repo, and trust anyone with access keys to contribute without any checks?
I am struggling to see how that would work at an enterprise level.

Topic		Replies	Views
Is there a review mechanism for pushing dataset through DVC? Questions	1	460	August 10, 2021
A separate data-registry for each dataset or combine them into one? Questions	1	421	July 23, 2022
Large Data Registry on NAS with multiple DVC and non-DVC users Questions	8	975	August 21, 2022
Best practices with data regsitry Questions	0	280	April 20, 2023
Using DVC for non-machine learning models Questions	1	830	October 2, 2020

Ensuring dataset quality when maintaining a data registry

Related topics