How to represent JSON data with a schema using PostgreSQL in a scalable way?-CodePudding

I have an application that needs to store arbitrary JSON data with a schema. I have the schema validation/serialization/etc in place, but I'm a bit puzzled about how to store it in PostgreSQL. My main concern is scalability: what happens if my database grows (like over 100GB in size). My current schema looks like this:

CREATE TABLE "Schema" (
    "namespace" CHAR(50) NOT NULL,
    "name" CHAR(50) NOT NULL,
    "version" CHAR(50) NOT NULL,
    "schemaObject" JSONB NOT NULL,
    "createdAt" TIMESTAMP(6) NOT NULL DEFAULT CURRENT_TIMESTAMP,
    "updatedAt" TIMESTAMP(6) NOT NULL DEFAULT CURRENT_TIMESTAMP,
    "deletedAt" TIMESTAMP(6),

    CONSTRAINT "Schema_pkey" PRIMARY KEY ("namespace","name","version")
);

CREATE TABLE "Data" (
    "id" TEXT NOT NULL,
    "data" JSONB NOT NULL,
    "createdAt" TIMESTAMP(6) NOT NULL,
    "updatedAt" TIMESTAMP(6) NOT NULL,
    "deletedAt" TIMESTAMP(6),
    "schemaNamespace" CHAR(50) NOT NULL,
    "schemaName" CHAR(50) NOT NULL,
    "schemaVersion" CHAR(50) NOT NULL,

    CONSTRAINT "Data_pkey" PRIMARY KEY ("id")
);

ALTER TABLE "Data" ADD CONSTRAINT "Data_schemaNamespace_schemaName_schemaVersion_fkey" FOREIGN KEY ("schemaNamespace", "schemaName", "schemaVersion") REFERENCES "Schema"("namespace", "name", "version") ON DELETE RESTRICT ON UPDATE CASCADE;

So each JSON Schema is stored in the Schema table, and it is uniquely identified by a namespace name version combo. Then I have a Data table where I can store individual records. How can I improve this to be scalable? I have concerns with the "store everything in 1 table" idea. Am I doing it wrong, or is this the right way?

A little more information on the application that will use this: it is a data aggregation service, that will supply a federated query interface (GraphQL) for outside clients. Each and every Data object will have an id in it, and I'll query based on that id, but apart from that, I'll only query lists of data for a Schema. This also has to be a general-purpose solution, I don't expect specific query patterns. I'm also going to use a cursor probably based on the highly granular createdAt field (I'm not expecting writes more often that's supported by a precision of 6).

CodePudding user response：

Your question is very broad. Such table can scale to hundreds of GBs, but you need to ask yourself what kind of operations you are about to run on that data, so as to be able to figure out what else you would need to do (most probably indexes on some JSON fields, etc.).

The true power in using JSON in an SQL-based RDBMS lies in the opportunity to get the best of both worlds. That means it is often best to use structured data, foreign key constraints, etc. for at least some parts of the data and leave the JSON type only for those parts where you need more flexibility.

CodePudding user response：

You add too little information for a definitive answer. For example, what is going to be stored in those JSON columns? How big will these data be? How are these data going to be used in queries?

But I suspect that your design is not good. For example, you say that each JSON object will have an attribute named id that you will want to query for. It would be much better to store that attribute as a regular table column!

Try coming up with a data model that avoids JSON at all. Only resort to JSON for those aspects of the data that cannot be modeled otherwise. Maybe you can get some inspiration from my article.