Using DVC outside git

kongkip · December 20, 2021, 1:04pm

Hello here,
I am grateful on what dvc offers so far in data management. I am working on a labeling tool for image segmentation with the image labels stored in the cloud storage after generation, my question is, is it possible to manage this data set with dvc considering the data folder is not a git/dvc working directory?, such that you can dvc pull from any project I wish to work on. The segmentation is added with every new labeling.

YanxiangGao · December 20, 2021, 2:35pm

Hi, are you looking for a data registry in which you can dvc import the data to any projects you want?

kongkip · December 20, 2021, 7:42pm

Checked this out but I don’t think it will solve my issue. To provide clarity the raw images are originally stored in AWS S3, then the labeling tool accesses the image URLs which are then embedded as images on the UI where one labels and generates the segmentations. For this case this is not a DVC repo yet, so how can the data be managed with DVC since there are no commands run locally such as dvc add or dvc push.

kongkip · December 20, 2021, 7:47pm

To understand, data registry will work only if I start locally creating a git repo on the data and pushing it to the cloud with dvc push. The way dvc also stores the data on the cloud storage will make it hard for the labeling tool which is a JS based app.

YanxiangGao · December 21, 2021, 2:31am

OK, thanks for your clarification. In my opinion, if you want to use DVC to track the data version. You might still need a dvc+git repo to store your DVC version files. Even if you are using some advanced features like eternal data or some hacking method like using DVC’s Repo API (Repo.add, Repo.push, Repo.pull). They all need a local repo to operate.

volkfox · December 21, 2021, 3:45am

We are working on a new tool in DVC ecosystem that sounds like a better solution.

Do you mind describing your workflow in a greater detail?

kongkip · December 21, 2021, 5:59am

ok sure, let me drop a diagram here

The final storage is the dvc remote storage which can be exported to any project

volkfox · December 21, 2021, 7:30am

what do you use csv for?

kongkip · December 21, 2021, 8:59am

it contains points for the labels e.g, ellipse, circles and polygons which are used to generate image segmenations.

volkfox · December 21, 2021, 4:14pm

Just to clarify - your images and labels live in s3 and you consider this location to be immutable (in a sense that you can guarantee they will not be accidentally deleted or moved as you keep adding new labels).

If this is the case, we can provide tracking of labels versions and datasets (defined as collections of pointers) with our new tool.

kongkip · January 11, 2022, 10:09am

This sounds nice, update me once you do have it ready or in beta.

Topic		Replies	Views
Data (registry) and remote GPU cluster with local DVC repositories Questions	6	715	July 5, 2022
Using DVC for non-machine learning models Questions	1	806	October 2, 2020
Implementing simple snapshots of file moves, renames, deletes Questions	3	31	October 14, 2024
Is it Possible to train data in s3 bucket without downloading to local machine with DVC? Questions	11	779	March 22, 2023
DVC local storage usecase Questions	6	1605	January 20, 2021

Using DVC outside git

Related topics