Let’s assume that we have data on an S3 Bucket. The data is downloaded locally and after processing, a TFRecord of the data is made.
The data in the S3 Server has increased in samples and we don’t want to redo the pipeline for all of the data. The new data is on the same S3 bucket as the previous version. No versioning of the data is done on the S3.
- How does DVC understand that the data has changed? Do we need to trigger it manually?
- How can the pipeline be only run for the new version of the data?
- How can be tag the data and/or the TFRecord with a certain version number that is implicitly in the system? Can this tag be read from anywhere?
I would appreciate if you can share your solutions.