how to manage common information between multiple tables in databases-CodePudding

this is my first question on stack-overflow, i am a full-stack developer i work with the following stack: Java - spring - angular - MySQL. i am working on a side project and i have a database design questions.

i have some information that are common between multiple tables like:

Document information (can be used initially in FOLDER and CONTRACT tables).
Type information(tables: COURT, FOLDER, OPPONENT, ...).
Status (tables: CONTRACT, FOLDER, ...).
Address (tables: OFFICE, CLIENT, OPPONENT, COURT, ...).

To avoid repetition and coupling the core tables with "Technical" tables (information that can be used in many tables). i am thinking about merging the "Technical" tables into one functional table. for example we can have a generic DOCUMENT table with the following columns:

ID
TITLE
DESCRIPTION
CREATION_DATE
TYPE_DOCUMENT (FOLDER, CONTRACT, ...)
OBJECT_ID (Primary key of the TYPE_DOCUMENT Table)
OFFICE_ID
PATT_DATA

for example we can retrieve the information about a document with the following query:

SELECT * FROM DOCUMENT WHERE OFFICE_ID = "office 1 ID" AND TYPE_DOCUMENT = "CONTRACT" AND OBJECT_ID= "contract ID";

we can also use the following index to optimize the query: CREATE INDEX idx_document_retrieve ON DOCUMENT (OFFICE_ID, TYPE_DOCUMENT, OBJECT_ID);

My questions are:

is this a good design.
is there a better way of implementing this design.
should i just use normal database design, for example a Folder can have many documents, so i create a folder_document table with the folder_id as a foreign key. and do the same for all the tables.

Any suggestions or notes are very welcomed and thank you in advance for the help.

CodePudding user response：

What you're describing sounds like you're trying to decide whether to denormalize and how much to denormalize.

The answer is: it depends on your queries. Denormalization makes it more convenient or more performant to do certain queries against your data, at the expense of making it harder or more inefficient to do other queries. It also makes it hard to keep the redundant data in sync.

So you would like to minimize the denormalization and do it only when it gives you good advantages in queries you need to be optimal.

Normalizing optimizes for data relationships. This makes a database organization that is not optimized for any specific query, but is equally well suited to all your queries, and it also has the advantage of preventing data anomalies.

Denormalization optimizes for specific queries, but at the expense of other queries. It's up to you to know which of your queries you need to prioritize, and which of your queries can suffer.

There's no way anyone on Stack Overflow can know your queries better than you do.

CodePudding user response：

Case 1: status

"Status" is usually a single value. To make it readable, you might use ENUM. If you need further info about a status, there be a separate table with PRIMARY KEY(status) with other columns about the statuses.

Case 2: address

"Address" is bulky and possibly multiple columns. (However, since the components of an "address" is rarely needed by in WHERE or ORDER BY clauses, there is rarely a good reason to have it in any form other than TEXT and with embedded newlines.

However, "addressis usually implemented as several separate fields. In this case, a separate table is a good idea. It would have a columnid MEDIUMINT UNSIGNED AUTO_INCREMENT PRIMARY KEYand the various columns. Then, the other tables would simply refer to it with anaddress_idcolumn andJOIN` to that table when needed. This is clean and works well even if many tables have addresses.

One Caveat: When you need to change the address of some entity, be careful if you have de-dupped the addresses. It is probably better to always add a new address and waste the space for any no-longer-needed address.

Discussion

Those two cases (status and access) are perhaps the extremes. For each potentially common column, decide which makes more sense. As Bill points, out, you really need to be thinking about the queries in order to get the schema 'right'. You must write the main queries before deciding on indexes other than the PRIMARY KEY. (So, I won't now address your question about an Index.)

Do not use a 4-byte INT for something that is small, mostly immutable, and easier to read:

2-byte country_code (US, UK, JP, ...)
5-byte zip-code CHAR(5) CHARSET ascii; similar for 6-byte postal_code
1-byte `ENUM('maybe', 'no', 'yes')
1-byte `ENUM('not_specified', 'Male', 'Female', 'other'); this might not be good if you try to enumerate all the "others".
1-byte ENUM('folder', ...)

Your "folder" vs "document" is an example of a one-to-many relationship. Yes, it is implemented by having doc_id in the table Folders.

"many-to-many" requires an extra table for connecting the two tables.

ENUM

Some will argue against ever using ENUM. In your situation, there is no way to ensure that each table uses the same definition of, for example, doc_type. It is easy to add a new option on the end of the list, but costly to otherwise rearrange an ENUM.

id (or ID) is almost universally reserved (by convention) to mean the PRIMARY KEY of a table, and it is usually (but not necessarily) AUTO_INCREMENT. Please don't violate this convention. Notice in my example above, id was the PK of the Addresses table, but called address_id in the referring table. You can optionally make a FOREIGN KEY between the two tables.