Best practices for data stored on cloud?

I wanted to get some advice on what are the best practices for using DVC for the following use case:

  • Data will be stored on S3 buckets and updated regularly. The previous data is not changed, only updated with new entries.
  • Every time the data is updated, we will run some analysis and train a model on this data. This should be done on the cloud as well.
  • Data size will be in the order of hundred GBs. Will this cause problems regarding caching and data duplication? Should reflinks be used in this case?
  • We are intending to use DVC to track the data versions used for each experiment (analysis and models) and have the option to rollback to previous versions.
  • Nothing will be executed locally.

What would be the recommended DVC flow for this pipeline?
Please let me know if you’d like more clarifications.