Can I use dvc over a postgres database folder?

jnaranjo · December 21, 2023, 8:42am

Hello!
I am working with a postgres database. The database files are located in a folder which is a shared volume between my computer and a postgres container. My question is if it would be correct to version that folder with dvc.
If the folder is called database it would be to do

dvc add database

and from there make new versions as I add entries.

Is this a correct use of DVC or should I do something else?
I have only seen examples where data is versioned: images, etc.

Thank you very much

skshetry · December 21, 2023, 10:53am

Hi, what is inside database folder? If you are talking about files, then you can dvc add database, although I’d not suggest using a shared volume with postgres.

If your goal is to track a database table, we are working on implementing dvc import-db command that can snapshot your database table (or, any SQL queries) to a csv/json format.

For that, you’ll need a very latest dvc version, and set connection string in the config.

dvc config db.pgsql.url postgresql://<hostname>:<port>/<db_name>
dvc config --local db.pgsql.user <user>
dvc config --local db.pgsql.password <password>

Then, you can materialize in dvc with following, which will create a results.csv file.

dvc import-db --sql "select * from Table" --conn pgsql

You can then depend on it in your pipelines. Note that this is very experimental, and can change in the future. Please give feedback or create an issue for bugs/feature-requests.

jnaranjo · December 21, 2023, 11:29am

thank you for the quick response. Inside the folder are the files generated by the docker container (pg_serial, pg_subtrans, etc.). The modification of the database is done with sqlalchemy. My idea is to track the whole database as a whole. I am in a project where records are added to all tables periodically, for example, every 3 months. I would like to generate a version every time records are added and have a reference point of the content of those tables at that instant of time.

Why do you say it is not optimal to do the dvc add if it is a shared volume of desserts?
Does the suggestion you gave me at the end make sense in my scenario?

Best regards

skshetry · December 21, 2023, 12:28pm

Hi. You can version anything with DVC, nothing restricts you to do that. But, DVC’s goal is to version your datasets, and connect that datasets to your outputs (e.g.: models) via pipelines.

Tracking a random blob does not give you a visibility into your pipelines. Whereas, dvc allows you to periodically snapshot your database, and use that in your training. (you can run dvc update when you want to update to a newer version of your database).

You cannot consume that database directory in your pipelines, either. Could you please elaborate what you are trying to do with DVC?

(If you are looking for purely database versioning. I am sure there are better ways to backup and version them.)

jnaranjo · December 21, 2023, 2:20pm

What I want to do is just versioning. I don’t mind that I can’t add it to pipelines. I just want to have reference points in case I want to go back at some point or want to test something with the database at some specific point (version).

Topic		Replies	Views
Can dvc support remote large DBs? Questions	2	1264	June 1, 2021
Raw data from google big query Questions	4	1756	June 2, 2020
Working on remote data Questions	1	143	April 29, 2024
DVC and SNOWFLAKE setup Questions	2	67	February 5, 2025
DVC and Hydra integration Questions	5	2001	September 22, 2022

Can I use dvc over a postgres database folder?

Related topics