The workflow Supabase uses for version controlling the (PostgreSQL) schema is for the developer to manually run a tool that diffs the schema after every change and saves the resulting migration to a file.
I found it surprising to build the database schema using imperative commands (CREATE TABLE, ENABLE ROW LEVEL SECURITY, etc.) and then use a diffing tool to create imperative migrations.
It seems like a better approach would be to describe the schema using some declarative markup language, like JSON, YAML, XML, etc., and then migrations could just be the Git diffs between versions of the schema declaration.
I found the confusing assertion from many people that SQL is declarative, when from my perspective (as a newbie) it appears to consist of commands that must be executed in order. I found a couple of projects that claimed to be working on declarative database schema instead of migrations (skeema.io, schemahero.io), but no sign of widespread adoption.
Why this is the case? Is it an artifact of history, or are there some key challenges with a declarative approach to schema management that I am missing?
I found this answer on Reddit:
90% of the effort in updating database schemas is updating the data. Thich can not be done with desired state. For example, introduce a city reference table, you just can't do this with schema diff, you need to make business logic decisions while you migrate your data over. So use one of the many existing migration systems and forget about this idea.
So is the answer that you need to issue imperative commands on a case-by-case basis to deal with migrating the data, and a declarative engine would be unable to do that?
CodePudding user response:
The letters SQL means Structured Query Language. The most important word into this name is QUERY that means it is not a procedural language. In a procedural language, you write the exact commands that you want the computer to do. In SQL, a "query" language, you do not write a program code, but only the desired answer, then the SQL algrebrizer/optimizer have to compute the program that will be executed by the query processor (known as "query execution plan"). This is valid as well as SELECT, INSERT, UPDATE... queries (which is the DML part of the language) but also for CREATE, ALTER, DROP (DDL part) or GRANT / REVOKE (DCL part).
To optimize any query, the optimizer needs to have a complete view of the structure and this structure must be stable. If any part of the structure is missing (as an example, a table that does not exists before a CREATE VIEW that involves the table) the SQL command will fail. Another thing is that the "compilation" of the SQL code is done at the moment of the execution of the query and in some RDBMS, the execution plan is cached into the memory. If the structure is not stable, all execution plan will be discard from the cache when any part of a SQL object is altered or dropped, an index created or important changes has been done into the data (optimizer's statistics recompute).
Also, views, that are special tables, are based on table, but logically independant... That means that a view can be de-synchronized from the tables data structure that its uses...
For all these reason you cannot compare an execution language to a query language...
CodePudding user response:
The actual question seems to be
"Why do we keep db schema definition in
.sql
and not in.json
,.yml
or.xml
?"
and it's probably assuming that holding the entire definition as one object is declarative and holding it as a series of commands definiting things bit by bit is imperative. To unwrap this:
- What vs how is a better simplification of declarative vs imperative. SQL is declarative. You define what your schema is and what you want from it. The RDBMS handles how it's carried out. You can watch the output of
EXPLAIN ANALYZE VERBOSE
added before your query and see the planner come up with an optimal way of achieving that. You can also check out the file system representation of your definitions, also figured out by the db, not you. - Holding your schema definition in
.json
even in a single object doesn't make it a one-off declaration. After all, it'll consist of sub-objects that actually declare the individual relations - and at some point it's parsed and translated to a series of corresponding SQL "commands". - Neither a plain
.sql
nor a.json
have any advantage over the other when it comes to diffing in git. - The reason it's usually held as
.json
is to abstract from a specific SQL dialect and put a translation layer between that and the actual underlying db. It's also to accommodate custom mapping between objects in the app and relations in the db. - An ORM tool can diff your
.json
-based schema and run a singlecreate
/alter
/drop
based on that. - Both schemahero.io and skeema.io seem to be just tools to formulate an
alter
based on a difference between twocreate
statements, to get from one to the other, without re-creating it from scratch, which is in essence generating a raw SQL equivalent of a migration based on raw SQL schema.