How does DVC do version checking?

Hi,

Let’s assume that we have data on an S3 Bucket. The data is downloaded locally and after processing, a TFRecord of the data is made.

The data in the S3 Server has increased in samples and we don’t want to redo the pipeline for all of the data. The new data is on the same S3 bucket as the previous version. No versioning of the data is done on the S3.

  1. How does DVC understand that the data has changed? Do we need to trigger it manually?
  2. How can the pipeline be only run for the new version of the data?
  3. How can be tag the data and/or the TFRecord with a certain version number that is implicitly in the system? Can this tag be read from anywhere?

I would appreciate if you can share your solutions.

Regards,
Siavash

Hi @ssakhavi !

It computes checksums and if they don’t match with what is recorded in dvc files, dvc concludes that data was changed.

Dvc won’t re-run your pipeline if data didn’t change, so it is a basic feature of dvc.

Currently you can either use git tags for that or use a special naming strategy for your data, so that version is reflected in the filename. We are working on additional dvc tag feature, that would introduce dvc tags, that you will be able to set for particular data files in respective dvcfiles.

Thanks,
Ruslan

1 Like

Dear Ruslan,

Thanks for the reply.

Regarding my second question, in the case, that data has been added to the original data, the whole pipeline will be re-executed to result in an output.

What if we don’t want to execute the pipeline for the whole data but rather only for the part that has been newly added? Is there any way that we can make DVC aware of this?

If I understand correctly, then what you mean is a transformation of the data?
E.g. some function f(X)=y which needs to be applied to each piece of data separately, so that given some new datapoint X_n+1 you only need to compute and save y_n+1 and can keep all the previous y’s?

I would suggest to use the new
dvc run -d INPUT_DATA_FOLDER --outs-persist OUTPUT_FOLDER
option.

Then, when your repro runs, you can check in the output folder (which won’t be removed by DVC prior to repro, due to the --outs-persist) which data points already have computed y’s and only compute the y’s for new data.

2 Likes

Maybe I was not clear in my first question.

Let’s assume we have our first batch of raw data and we denote it D1. New data comes and we denote it D2. That means the new version of our raw data is D1+D2.

Assuming that we have run a pipeline on D1, we don’t want the pipeline to be applied on D1+D2 but rather on D2 only. This can be because of computational limitation.

How can this be implemented in DVC?

My suggestion above still applies, it will need to be application-level logic and not using DVC as of now.
Meaning, the DVC pipeline WILL be re-run, but you can just avoid the unnecessary computation by looking at which outputs are already computed.

Since DVC doesn’t know what you intend to do with the data, it has to rerun the pipeline. For instance, if your stage is training a model on all the given data D1+D2, you can’t just train it on D2 without D1 and combine the results meaningfully. At least not without application-level logic.

2 Likes

Thanks for the answer.
I understood your suggestions now.

1 Like