DVC integration with AZURE ML Pipeline and versioning IOT data

writetoneeraj · April 24, 2020, 12:05pm

I have two questions.

My data storage is Azure blob which contain parquet files. These parquet files are output of preprocessing IOT data which means data keep coming in storage blobs. How to version data with dvc because with every change dvc will create a new version and then i will end up in many many data versions.
Please help me in understanding how people integrate dvc with continuous data.
I am using Azure ML env. for ML pipeline. How i can integrate dvc with Azure ML env.
Thanks in advance.

jorgeorpinel · April 24, 2020, 3:40pm

How do you determine when your data stream needs to be versioned? If you can measure and determine this programmatically, probably you can create a process that makes the DVC versions (with the dvc add command) and commits them with Git.

I’m not familiar with this tool but from https://docs.microsoft.com/en-us/azure/machine-learning/concept-ml-pipelines it seems like its a graphic UI to create data workflows. I imagine you could simply install DVC on that env and use our data versioning commands in DVC like dvc checkout, dvc pull, dvc push, et al. to manage and access data from those pipeline steps.

DVC also provides a layer of features and CLI commands to create pipelines such as dvc run, dvc import, and dvc repro. I’m not sure whether these can be integrated with Azure’s platform or rather you could consider replacing that with DVC depending on your need. Please note that DVC also provides another layer of features to manage experiments through hyperparameters (dvc params) and metrics (dvc metrics).

Please refer to our command reference in Command Reference for more details on all the commands I’ve mentioned.

I hope this helps answer your questions but feel free to send any more doubts or precisions. Thanks

writetoneeraj · May 15, 2020, 6:43am

Thank for replying.
Let me explain my scenario more clearly

My data sits on Azure storage blob which is continuously growing. I have another VM machine and there i have installed git and dvc. I initialized git and dvc and trying to commit. But getting error that repository does not exists.
Steps i did:
git init
dvc init
dvc remote add --local azuremlblob azure://<container_name>/<folder_name>
<folder_name> is the data folder i want to track.
dvc remote modify --local azuremlblob connection_string “connection string”
dvc push
git add
git commit
and get error
fatal: ‘azuremlblob’ does not appear to be a git repository
fatal: Could not read from remote repository.
which makes sense there is no repository on azure named as azuremlblob
In this case what should i do.

kupruser · May 15, 2020, 10:15am

Hi @writetoneeraj !

So git commit gives you that error? If so, I suppose you have accidentally added azuremlblob as a git remote. Could you show your .git/config?

writetoneeraj · May 15, 2020, 10:20am

.git/config

[core]
repositoryformatversion = 0
filemode = true
bare = false
logallrefupdates = true

kupruser · May 15, 2020, 10:22am

Yeah, that looks alright. Maybe you’ve accidentally configured it at a different level. Could you show git remote output, please?

writetoneeraj · May 15, 2020, 10:34am

The output is blank, no output when i run git remote

writetoneeraj · May 15, 2020, 10:47am

Ok, i got it why it is showing blank let me fix this first. Will get back.

kupruser · May 15, 2020, 11:01am

dvc remote add --local azuremlblob azure://<container_name>/<folder_name>
<folder_name> is the data folder i want to track.

btw, maybe there is an additional misunderstanding here. Dvc can’t track azure folders directly. You need to add them locally.

Have you tried our get-started guide, btw? It explains the basic workflow nicely.

writetoneeraj · May 21, 2020, 7:36am

Yes i have gone through get-started guide. I understand that i have to add the files locally.
I have a doubt on how to add blob in dvc. I get data, i convert in parquet format and store on azure storage blob. Whatever i see till now in docs is add some file or folder of files. I tried with remote add as mentioned above. When i add some file in remote storage it adds something like ‘/aa/6804a8d04fec7fb88fd9c1f2f17368’ which looks like metadata. Now my question is instead of adding some file i want to add file which is stored as blob in storage account.

Paffciu · May 21, 2020, 1:49pm

@writetoneeraj
So you would like to track blob placed directly on azure blob storage and not locally?

writetoneeraj · May 22, 2020, 6:39am

Yes data processed and directly stored in blobs. We read data from these blobs, so blobs are required to be versioned. Here every parquet file is a blob which represent one day data for a particular site and we keep getting data for different sites daily. I want to club say 15 or 30 days data for all sites and version it.

kupruser · May 23, 2020, 12:09am

@writetoneeraj That’s not metadata, that’s how dvc stores data on remote to provide deduplication. You should use your git repo to access files on your remote. E.g. by either dvc pulling in it or using commands like dvc get/import (see https://dvc.org/doc/use-cases/data-registries) to access it from the outside.

Topic		Replies	Views
Versioning not working with azure blob that is version aware Questions	2	28	July 24, 2024
How use DVC with Azure Blob Storage Questions	4	6072	September 5, 2020
Track remote data on Azure Questions	2	1036	March 11, 2022
DVC Heartbeat - Discord gems Announcements	3	4165	June 27, 2019
Dvc non bare remote Questions	2	389	September 23, 2021

DVC integration with AZURE ML Pipeline and versioning IOT data

Related topics