Hi!
I wanted to get some advice on what are the best practices for using DVC for the following use case:
- Data will be stored on S3 buckets and updated regularly. The previous data is not changed, only updated with new entries.
- Every time the data is updated, we will run some analysis and train a model on this data. This should be done on the cloud as well.
- Data size will be in the order of hundred GBs. Will this cause problems regarding caching and data duplication? Should reflinks be used in this case?
- We are intending to use DVC to track the data versions used for each experiment (analysis and models) and have the option to rollback to previous versions.
- Nothing will be executed locally.
What would be the recommended DVC flow for this pipeline?
Please let me know if you’d like more clarifications.
Thanks!