DVC and data lake

framsouza · September 10, 2019, 11:58am

Hey guys, quick question.

Following the ML flow best practice, we must store our raw data in one data lake structure, right?

DVC can use Google Storage/S3 and a lot of other storage support to store our code/dataset and models in one common place.

Is it correct to say than I have one “Data Lake” structure just using DVC?

I’m using DVC to store my dataset and models, and now I would like to know if is necessary to implement one data lake architecture on top of that.

jorgeorpinel · September 12, 2019, 6:58am

Hi @framsouza!

I don’t think we can make this string connection between DVC and the concept of data lakes. DVC is pretty much agnostic to architecture, language, framework, design patters, etc! If anything, DVC helps you organize data and its processes (with stages and pipelines) so it doesn’t sound to me much like data lakes.

But maybe I’m not understanding your question, can you provide some definitions? What is a data lake? Example of a “data lake architecture”?

Thanks!

Topic		Replies	Views
Best practices for data stored on cloud? Questions	0	385	January 4, 2023
DVC compared with GitLFS for storage and versioning only Questions	12	6909	October 13, 2020
Is it Possible to train data in s3 bucket without downloading to local machine with DVC? Questions	11	779	March 22, 2023
Using DVC outside git Questions	10	1220	January 11, 2022
DVC local storage usecase Questions	6	1605	January 20, 2021

DVC and data lake

Related topics