I am trying to understand the differences between the new CockroackDB and other distributed SQL databases as compared to a cloud-managed database like Azure SQL Database.
It seems there is no difference in the use cases between them:
- Like various NOSQL databases SQL (in general) allows partitioning keys.
- I can add cores in Azure to increase the performance as needed, I can also switch to Hyper-scale if I have an elastic workload.
- I can have read replication across multiple nodes over multiple availability zones (geo-locations)
- I can configure data replication in Azure SQL Database too.
It seems to me that a cloud SQL database covers all the use cases the newer distributed databases cover, so why would I want to use a newer product ?
Isn't Azure SQL Database basically a distributed database server ?
Am I missing something ?
CodePudding user response:
Is Azure SQL Server a Distributed SQL database?
No.
Like various NOSQL databases SQL (in general) allows partitioning keys.
Partitioning in NoSQL databases like Cassandra (and Azure Table Storage) is about distributing partitions to physically distinct nodes, and requires rows to have an explicitly set partition-key value.
- Cassandra nodes are physically different machines that can run independently, which gives it excellent resiliency.
Partitioning in SQL Server, Azure SQL, and Azure SQL Managed Instance is about dividing data up into row-groups that exist in the same server for performance, not resiliency.
- On on-prem MS SQL Server, these row-groups (well, partitions) can exist in different
FILEGROUP
s, which means they can exist in different storage volumes to avoid IO bottlenecks, but Azure SQL does not support multipleFILEGROUP
s.- The benefits of implementing partitioning, including on Azure SQL, are documented online - and the article explains how it's about performance, not resilience.
- On on-prem MS SQL Server, these row-groups (well, partitions) can exist in different
I can add cores in Azure to increase the performance as needed, I can also switch to Hyper-scale if I have an elastic workload.
This fact has absolutely nothing to do with distributed databases.
I can have read replication across multiple nodes over multiple availability zones (geo-locations).
I can configure data replication in Azure SQL Database too.
- Replication isn't the same thing as a true distributed database:
- In Cassandra and other distributed databases, all clients can connect to all nodes and accomplish the same tasks; and you can arbitrarily add and remove nodes while the system is running.
- In SQL Server and Azure SQL's replication feature, the replica is strictly a "secondary" that is subordinate to your primary server.
- Clients can connect to either the secondary or the primary, but the secondary server can only perform read-only queries, whereas if a client wants to do DML (
INSERT/UPDATE/DELETE/MERGE
) or DDL (CREATE/ALTER
) then the client must connect to the primary server.
- Clients can connect to either the secondary or the primary, but the secondary server can only perform read-only queries, whereas if a client wants to do DML (
It seems to me that a cloud SQL database covers all the use cases the newer distributed databases cover, so why would I want to use a newer product?
It can't: because Azure SQL is not a distributed database it cannot allow any client to read and write to any node or endpoint and have that change replicated to all other nodes (using an eventual consistency model). Instead, Azure SQL requires writes to be performed by the single primary "server".
Note that an Azure SQL "server" or logical server is largely an abstraction that hides what Azure SQL really is: a distinct build of SQL Server's engine that runs in a high-availability Azure Service Fabric environment (which is how cores/RAM can be added and removed while it's running and provides for some kind of local resilience against hardware failure) in a single Azure datacenter.