Best practices for data stored on cloud?

nael · January 4, 2023, 2:10pm

Hi!
I wanted to get some advice on what are the best practices for using DVC for the following use case:

Data will be stored on S3 buckets and updated regularly. The previous data is not changed, only updated with new entries.
Every time the data is updated, we will run some analysis and train a model on this data. This should be done on the cloud as well.
Data size will be in the order of hundred GBs. Will this cause problems regarding caching and data duplication? Should reflinks be used in this case?
We are intending to use DVC to track the data versions used for each experiment (analysis and models) and have the option to rollback to previous versions.
Nothing will be executed locally.

What would be the recommended DVC flow for this pipeline?
Please let me know if you’d like more clarifications.

Thanks!

Topic		Replies	Views
Best practice for handling large data Questions	5	2592	April 16, 2021
Is it Possible to train data in s3 bucket without downloading to local machine with DVC? Questions	11	772	March 22, 2023
Total data usage and use of external shared cache Questions	1	384	May 26, 2023
DVC local storage usecase Questions	6	1599	January 20, 2021
DVC and data lake Questions	1	1369	September 12, 2019