Using DVC outside git

Hello here,
I am grateful on what dvc offers so far in data management. I am working on a labeling tool for image segmentation with the image labels stored in the cloud storage after generation, my question is, is it possible to manage this data set with dvc considering the data folder is not a git/dvc working directory?, such that you can dvc pull from any project I wish to work on. The segmentation is added with every new labeling.

Hi, are you looking for a data registry in which you can dvc import the data to any projects you want?

Checked this out but I don’t think it will solve my issue. To provide clarity the raw images are originally stored in AWS S3, then the labeling tool accesses the image URLs which are then embedded as images on the UI where one labels and generates the segmentations. For this case this is not a DVC repo yet, so how can the data be managed with DVC since there are no commands run locally such as dvc add or dvc push.

To understand, data registry will work only if I start locally creating a git repo on the data and pushing it to the cloud with dvc push. The way dvc also stores the data on the cloud storage will make it hard for the labeling tool which is a JS based app.

OK, thanks for your clarification. In my opinion, if you want to use DVC to track the data version. You might still need a dvc+git repo to store your DVC version files. Even if you are using some advanced features like eternal data or some hacking method like using DVC’s Repo API (Repo.add, Repo.push, Repo.pull). They all need a local repo to operate.

We are working on a new tool in DVC ecosystem that sounds like a better solution.

Do you mind describing your workflow in a greater detail?

ok sure, let me drop a diagram here

The final storage is the dvc remote storage which can be exported to any project

what do you use csv for?

it contains points for the labels e.g, ellipse, circles and polygons which are used to generate image segmenations.

Just to clarify - your images and labels live in s3 and you consider this location to be immutable (in a sense that you can guarantee they will not be accidentally deleted or moved as you keep adding new labels).

If this is the case, we can provide tracking of labels versions and datasets (defined as collections of pointers) with our new tool.

This sounds nice, update me once you do have it ready or in beta.