I am working with a postgres database. The database files are located in a folder which is a shared volume between my computer and a postgres container. My question is if it would be correct to version that folder with dvc.
If the folder is called database it would be to do
dvc add database
and from there make new versions as I add entries.
Is this a correct use of DVC or should I do something else?
I have only seen examples where data is versioned: images, etc.
Thank you very much
Hi, what is inside database folder? If you are talking about files, then you can
dvc add database, although I’d not suggest using a shared volume with postgres.
If your goal is to track a database table, we are working on implementing
dvc import-db command that can snapshot your database table (or, any SQL queries) to a csv/json format.
For that, you’ll need a very latest dvc version, and set connection string in the config.
dvc config db.pgsql.url postgresql://<hostname>:<port>/<db_name>
dvc config --local db.pgsql.user <user>
dvc config --local db.pgsql.password <password>
Then, you can materialize in
dvc with following, which will create a
dvc import-db --sql "select * from Table" --conn pgsql
You can then depend on it in your pipelines. Note that this is very experimental, and can change in the future. Please give feedback or create an issue for bugs/feature-requests.
thank you for the quick response. Inside the folder are the files generated by the docker container (pg_serial, pg_subtrans, etc.). The modification of the database is done with sqlalchemy. My idea is to track the whole database as a whole. I am in a project where records are added to all tables periodically, for example, every 3 months. I would like to generate a version every time records are added and have a reference point of the content of those tables at that instant of time.
Why do you say it is not optimal to do the dvc add if it is a shared volume of desserts?
Does the suggestion you gave me at the end make sense in my scenario?
Hi. You can version anything with DVC, nothing restricts you to do that. But, DVC’s goal is to version your datasets, and connect that datasets to your outputs (e.g.: models) via pipelines.
Tracking a random blob does not give you a visibility into your pipelines. Whereas, dvc allows you to periodically snapshot your database, and use that in your training. (you can run
dvc update when you want to update to a newer version of your database).
You cannot consume that
database directory in your pipelines, either. Could you please elaborate what you are trying to do with DVC?
(If you are looking for purely database versioning. I am sure there are better ways to backup and version them.)
What I want to do is just versioning. I don’t mind that I can’t add it to pipelines. I just want to have reference points in case I want to go back at some point or want to test something with the database at some specific point (version).