Can dvc support remote large DBs?

nrk · June 1, 2021, 7:26am

Hey all,
I tried DVC with small data set and I really liked it. The main thing DVC helped me with is controlling the data pipeline and versioning the data accordingly. In order to compare experiments we already use MLflow.

My team have another project with much bigger data set which is stored on postgres (on AWS). Can we use DVC in order to version our tables? For example:
Raw_Table → One_hot_conversion_table → Normalized_one_hot_conversion_table

skshetry · June 1, 2021, 9:05am

Hi, @nrk. DVC cannot version database tables. One way would be to transform the table and track the transformed dataset itself.

Another way might be to track processed datasets/outputs in the pipelines, that are small rather than the large dataset itself. PTAL on the ongoing discussion for function specific dependencies.

nrk · June 1, 2021, 9:32am

Transform as in converting columns into rows and vice versa? It won’t solve the size issue

Topic		Replies	Views
Working on remote data Questions	1	139	April 29, 2024
DVC with video data Questions	10	229	May 7, 2024
Best practice for handling large data Questions	5	2554	April 16, 2021
Advice for versioning many many small files? Questions	8	3593	January 13, 2021
Does dvc work for live streaming data versioning and batch data versioning ? If yes, can someone explain briefly Questions	3	1612	April 26, 2021

Can dvc support remote large DBs?

Related topics