Can dvc support remote large DBs?

Hey all,
I tried DVC with small data set and I really liked it. The main thing DVC helped me with is controlling the data pipeline and versioning the data accordingly. In order to compare experiments we already use MLflow.

My team have another project with much bigger data set which is stored on postgres (on AWS). Can we use DVC in order to version our tables? For example:
Raw_Table → One_hot_conversion_table → Normalized_one_hot_conversion_table

Hi, @nrk. DVC cannot version database tables. One way would be to transform the table and track the transformed dataset itself.

Another way might be to track processed datasets/outputs in the pipelines, that are small rather than the large dataset itself. PTAL on the ongoing discussion for function specific dependencies.

1 Like

Transform as in converting columns into rows and vice versa? It won’t solve the size issue