A sample data registry handled with GitHub CI

josecelano · November 30, 2021, 12:57pm

Hi,

I really like the idea of having Data Registries to maintain and re-use datasets. I’m working on an open-source project where we want to apply this concept but not necessarily related to machine learning. We have a collection of images that we want to use on different projects. Images are Chinese radicals and drawings related to those radicals. We call the “Data Registry” a “Media Library”.

In order to know if the DVC could fit our requirements, we have developed a proof of concept. The idea is to have a library which the DVC pointers and a website that uses the images from the library. In the future, we could have other projects using the images, including ML ones.

We want to handle the library with pull requests. Media contributors can add or delete the images. Some repository events trigger workflows that will do some tasks like generating different size/format versions of the images, interacting with the DVC storage, etcetera.

You can read a full explanation of the CI/CD processes here:

github.com

Nautilus-Cyberneering/chinese-ideographs/blob/main/documentation/Processes_and_Workflows.md

# About this document

The usage of the Chinese ideograph services and features implies several interdependent processes both triggered by humans and automation systems.

In this document we will review an end-to-end typical use case: the addition of a "gold" image to the repository: the actions that must be done by the users and the actions it triggers.

## Terminology

Please review this document to know more about the different kinds of images that can be present in the repository or its related entities (like a website that consumes it).

## Step 1: Manual image upload

The first step is adding the actual gold image file to the repository. This is detailed in this document, but the summary of it is:

1.1 - Add the image to the local DVC repository using the DVC ADD command. This generates two new files: a .dvc file (pointer) and a .gitignore file.
1.2 - Perform a GIT commit with both files (.dvc, .gitignore) to the local version of this repository, in a new branch (we will call it "image branch").
1.3 - Push the image file to the remote DVC storage with the DVC PUSH command.
1.4 - Push the commit created on 1.2 to the remote branch using the GIT PUSH command.

## Step 2: Gold image processing workflow

This file has been truncated. show original

I would like to know more use cases like this where DVC is primarily used to implement a versioning system for image collections in a completely automatized way.

Paffciu · December 8, 2021, 1:09pm

Hi @josecelano. Sorry for late reply. I am not aware of such projects, I will see if anyone from the team has knowledge about such use cases.

Paffciu · December 9, 2021, 7:37am

For ther record: We have created a channel in our Discord chat that will be solely dedicated to sharing different use cases: Discord

Topic		Replies	Views
Using DVC outside git Questions	10	1219	January 11, 2022
Using DVC for non-machine learning models Questions	1	804	October 2, 2020
Trying to understand data storage Questions	7	2313	October 30, 2022
DVC Heartbeat - Discord gems Announcements	3	4164	June 27, 2019
Dvc add and push after adding a couple of images Questions	3	631	November 27, 2023

A sample data registry handled with GitHub CI

Related topics