I couldn’t find any dvc documentation for batch data, and live streaming data versioning.
Is it possible in dvc to track streaming data and also fetch data in batch or time travel data?
Hi @veeresh,
DVC could be used to maintain snapshots of a growing dataset every X time (prob not real-time for a raw stream though) IF the dataset growth looks like adding/removing files. That would work efficiently because DVC de-duplicates data storage at the file level (see https://dvc.org/doc/user-guide/large-dataset-optimization). So in that narrow case, it’s technically possible.
That said, there may be better tools for timed data back-ups. And I would question whether snapshots of a data stream should be considered versions: You could say they’re all parts of the same version, which you just haven’t obtained completely yet (maybe never will). If the stream includes updates/corrections to previously received data points though, then that would more clearly represent versioning IMO.
This is all pretty conceptual though. Feel free to share a more specific use case you have in mind for more concrete answer
Hi @jorgeorpinel ,
Thanks for explaining.
I was looking if something similar to delta lake time travel is possible in dvc. (Introducing Delta Time Travel for Large Scale Data Lakes - The Databricks Blog)
Lets say there are 2 folders which keep updating every day (steaming),
-dogs
-cats
Basically, I want to get data between some time period (between 2 dates) , this is not possible in dvc right?
DVC uses Git as the underlying versioning layer. In Git you decide when and what to include in your commits. You can only switch between the commits that you have registered yourself.
The new Experiments features do include some automatic tracking of multiple project versions but again that has a different purpose (ML experiment management).
Thanks