Hey guys, quick question.
Following the ML flow best practice, we must store our raw data in one data lake structure, right?
DVC can use Google Storage/S3 and a lot of other storage support to store our code/dataset and models in one common place.
Is it correct to say than I have one “Data Lake” structure just using DVC?
I’m using DVC to store my dataset and models, and now I would like to know if is necessary to implement one data lake architecture on top of that.
I don’t think we can make this string connection between DVC and the concept of data lakes. DVC is pretty much agnostic to architecture, language, framework, design patters, etc! If anything, DVC helps you organize data and its processes (with stages and pipelines) so it doesn’t sound to me much like data lakes.
But maybe I’m not understanding your question, can you provide some definitions? What is a data lake? Example of a “data lake architecture”?