Does dvc work for live streaming data versioning and batch data versioning ? If yes, can someone explain briefly

veeresh · April 26, 2021, 5:29am

I couldn’t find any dvc documentation for batch data, and live streaming data versioning.
Is it possible in dvc to track streaming data and also fetch data in batch or time travel data?

jorgeorpinel · April 26, 2021, 5:52am

Hi @veeresh,

DVC could be used to maintain snapshots of a growing dataset every X time (prob not real-time for a raw stream though) IF the dataset growth looks like adding/removing files. That would work efficiently because DVC de-duplicates data storage at the file level (see https://dvc.org/doc/user-guide/large-dataset-optimization). So in that narrow case, it’s technically possible.

That said, there may be better tools for timed data back-ups. And I would question whether snapshots of a data stream should be considered versions: You could say they’re all parts of the same version, which you just haven’t obtained completely yet (maybe never will). If the stream includes updates/corrections to previously received data points though, then that would more clearly represent versioning IMO.

This is all pretty conceptual though. Feel free to share a more specific use case you have in mind for more concrete answer

veeresh · April 26, 2021, 6:40am

Hi @jorgeorpinel ,

Thanks for explaining.

I was looking if something similar to delta lake time travel is possible in dvc. (Introducing Delta Time Travel for Large Scale Data Lakes - The Databricks Blog)

Lets say there are 2 folders which keep updating every day (steaming),
-dogs
-cats
Basically, I want to get data between some time period (between 2 dates) , this is not possible in dvc right?

jorgeorpinel · April 26, 2021, 6:44am

DVC uses Git as the underlying versioning layer. In Git you decide when and what to include in your commits. You can only switch between the commits that you have registered yourself.

The new Experiments features do include some automatic tracking of multiple project versions but again that has a different purpose (ML experiment management).

Thanks

Topic		Replies	Views
DVC local storage usecase Questions	6	1605	January 20, 2021
DVC compared with GitLFS for storage and versioning only Questions	12	6909	October 13, 2020
Is it Possible to train data in s3 bucket without downloading to local machine with DVC? Questions	11	779	March 22, 2023
Best practices for data stored on cloud? Questions	0	385	January 4, 2023
Access remote data instead of downloading it Questions	8	667	March 3, 2023

Does dvc work for live streaming data versioning and batch data versioning ? If yes, can someone explain briefly

Related topics