DVC integration with AZURE ML Pipeline and versioning IOT data

I have two questions.

  1. My data storage is Azure blob which contain parquet files. These parquet files are output of preprocessing IOT data which means data keep coming in storage blobs. How to version data with dvc because with every change dvc will create a new version and then i will end up in many many data versions.
    Please help me in understanding how people integrate dvc with continuous data.
  2. I am using Azure ML env. for ML pipeline. How i can integrate dvc with Azure ML env.
    Thanks in advance.

How do you determine when your data stream needs to be versioned? If you can measure and determine this programmatically, probably you can create a process that makes the DVC versions (with the dvc add command) and commits them with Git.

I’m not familiar with this tool but from What are machine learning pipelines? - Azure Machine Learning | Microsoft Docs it seems like its a graphic UI to create data workflows. I imagine you could simply install DVC on that env and use our data versioning commands in DVC like dvc checkout, dvc pull, dvc push, et al. to manage and access data from those pipeline steps.

DVC also provides a layer of features and CLI commands to create pipelines such as dvc run, dvc import, and dvc repro. I’m not sure whether these can be integrated with Azure’s platform or rather you could consider replacing that with DVC depending on your need. Please note that DVC also provides another layer of features to manage experiments through hyperparameters (dvc params) and metrics (dvc metrics).

Please refer to our command reference in https://dvc.org/doc/command-reference for more details on all the commands I’ve mentioned.

I hope this helps answer your questions but feel free to send any more doubts or precisions. Thanks

Thank for replying.
Let me explain my scenario more clearly

My data sits on Azure storage blob which is continuously growing. I have another VM machine and there i have installed git and dvc. I initialized git and dvc and trying to commit. But getting error that repository does not exists.
Steps i did:
git init
dvc init
dvc remote add --local azuremlblob azure://<container_name>/<folder_name>
<folder_name> is the data folder i want to track.
dvc remote modify --local azuremlblob connection_string “connection string”
dvc push
git add
git commit
and get error
fatal: ‘azuremlblob’ does not appear to be a git repository
fatal: Could not read from remote repository.
which makes sense there is no repository on azure named as azuremlblob
In this case what should i do.

Hi @writetoneeraj !

So git commit gives you that error? If so, I suppose you have accidentally added azuremlblob as a git remote. Could you show your .git/config?


repositoryformatversion = 0
filemode = true
bare = false
logallrefupdates = true

Yeah, that looks alright. Maybe you’ve accidentally configured it at a different level. Could you show git remote output, please?

The output is blank, no output when i run git remote

Ok, i got it why it is showing blank let me fix this first. Will get back.

dvc remote add --local azuremlblob azure://<container_name>/<folder_name>
<folder_name> is the data folder i want to track.

btw, maybe there is an additional misunderstanding here. Dvc can’t track azure folders directly. You need to add them locally.

Have you tried our get-started guide, btw? It explains the basic workflow nicely.

Yes i have gone through get-started guide. I understand that i have to add the files locally.
I have a doubt on how to add blob in dvc. I get data, i convert in parquet format and store on azure storage blob. Whatever i see till now in docs is add some file or folder of files. I tried with remote add as mentioned above. When i add some file in remote storage it adds something like ‘/aa/6804a8d04fec7fb88fd9c1f2f17368’ which looks like metadata. Now my question is instead of adding some file i want to add file which is stored as blob in storage account.

So you would like to track blob placed directly on azure blob storage and not locally?

Yes data processed and directly stored in blobs. We read data from these blobs, so blobs are required to be versioned. Here every parquet file is a blob which represent one day data for a particular site and we keep getting data for different sites daily. I want to club say 15 or 30 days data for all sites and version it.

@writetoneeraj That’s not metadata, that’s how dvc stores data on remote to provide deduplication. You should use your git repo to access files on your remote. E.g. by either dvc pulling in it or using commands like dvc get/import (see https://dvc.org/doc/use-cases/data-registries) to access it from the outside.