Let’s assume that we have data on an S3 Bucket. The data is downloaded locally and after processing, a TFRecord of the data is made.
The data in the S3 Server has increased in samples and we don’t want to redo the pipeline for all of the data. The new data is on the same S3 bucket as the previous version. No versioning of the data is done on the S3.
How does DVC understand that the data has changed? Do we need to trigger it manually?
How can the pipeline be only run for the new version of the data?
How can be tag the data and/or the TFRecord with a certain version number that is implicitly in the system? Can this tag be read from anywhere?
I would appreciate if you can share your solutions.
It computes checksums and if they don’t match with what is recorded in dvc files, dvc concludes that data was changed.
Dvc won’t re-run your pipeline if data didn’t change, so it is a basic feature of dvc.
Currently you can either use git tags for that or use a special naming strategy for your data, so that version is reflected in the filename. We are working on additional dvc tag feature, that would introduce dvc tags, that you will be able to set for particular data files in respective dvcfiles.
If I understand correctly, then what you mean is a transformation of the data?
E.g. some function f(X)=y which needs to be applied to each piece of data separately, so that given some new datapoint X_n+1 you only need to compute and save y_n+1 and can keep all the previous y’s?
I would suggest to use the new dvc run -d INPUT_DATA_FOLDER --outs-persist OUTPUT_FOLDER
Then, when your repro runs, you can check in the output folder (which won’t be removed by DVC prior to repro, due to the --outs-persist) which data points already have computed y’s and only compute the y’s for new data.
My suggestion above still applies, it will need to be application-level logic and not using DVC as of now.
Meaning, the DVC pipeline WILL be re-run, but you can just avoid the unnecessary computation by looking at which outputs are already computed.
Since DVC doesn’t know what you intend to do with the data, it has to rerun the pipeline. For instance, if your stage is training a model on all the given data D1+D2, you can’t just train it on D2 without D1 and combine the results meaningfully. At least not without application-level logic.