Home > Software design >  What is the best way to store varying size columns in postres for language translation?
What is the best way to store varying size columns in postres for language translation?

Time:06-18

Lets say I create a table in postgres to store language translations. Lets say I have a table like EVENT that has multiple columns which need translation to other languages. Rather than adding new columns for each language in EVENT I would just add new rows in LANGUAGE with the same language_id.

EVENT:

id EVENT_NAME (fk to LANGUAGE.short) EVENT_DESCRIPTION (fk to LANGUAGE.long)
0 1 2

LANGUAGE:

language_id language_type short (char varying 200) med (char varying 50) long (char varying 2000)
1 english game of soccer null null
1 spanish partido de footbol null null
2 english null null A really great game of soccer
2 spanish null null Un gran partido de footbol

If I want the language specific version I would create a parameterized statement and pass in the language like this:

select event.id, name.short, desc.long
from event e, language name, language desc 
where e.id = 0
and e.event_name = name.language_id 
and name.language_type = ?
and e.event_desc = desc.language_id 
and desc.language_type = ?

My first thought was to have just a single column for the translated text but I would have to make it big enough to hold any translation. Setting to 2000 when many records will only be 50 characters seemed wrong. Hence I thought maybe to add different columns with different sizes and just use the appropriate size for the data Im storing (event name can be restricted to 50 characters on the front end and desc can be restricted to 2000 characters).

In the language table only one of the 3 columns (short,med,long) will be set per row. This is just my initial thought but trying to understand if this is a bad approach. Does the disk still reserve 2250 characters if I just set the short value? I read a while back that if you do this sort of thing in oracle it has to reserve the space for all columns in the disk block otherwise if you update the record it would have to do it dynamically which could be slow. Is Postgres the same?

It looks like you can specify a character varying type without a precision. Would it be more efficient (space wise) to just define a single column not specify the size or just a single column and specify the size as 2000?

CodePudding user response:

Just use a single column of data type text. That will perform just as good as a character varying(n) in PostgreSQL, because the implementation is exactly the same, minus the length check. PostgreSQL only stores as many characters as the string actually has, so there is no overhead in using text even for short strings.

In the words of the documentation:

There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column.

CodePudding user response:

Use text.

With Postgres, unless you really need to put a hard limit on the text size (and usually you do not), use text.

I see posts that varchar without n and text are the same performance but can be slightly slower than varchar(n).

This is incorrect with PostgreSQL 14. text is the most performant.

There is no performance difference among these three types [char, varchar, text], apart from increased storage space when using the blank-padded type [ie. char], and a few extra CPU cycles to check the length when storing into a length-constrained column [ie. varchar, char].

Storage for varchar and text is the same.

The storage requirement for a short string (up to 126 bytes) is 1 byte plus the actual string, which includes the space padding in the case of character. Longer strings have 4 bytes of overhead instead of 1. Long strings are compressed by the system automatically, so the physical requirement on disk might be less. Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values.


Do you need to reinvent this wheel?

Internationalization and localization tools already exist; they're built into many application frameworks.

Most localization systems work by using a string written in the developer's preferred language (for example, English) and then localizing that string as necessary. Then it becomes a simple hash lookup.

Rather than doing the localization in the database, query the message and then let the application framework localize it. This is simpler and faster and easier to work with and avoids putting additional load on the database.


If you really must do it in the database, here's some notes:

  • language is a confusing name for your table, perhaps messages or translations or localizations.
  • Use standard IETF language tags such as en-US.
  • Any fixed set of text, such as language tags, put into a table for referential integrity.
    • You could use an enum, but they're awkward to read and change.
  • Rather than having short, medium, and long columns, consider having one row for each message and a size column. This avoids a lot of null columns.
    • In fact, don't have "size" at all. It's arbitrary and subjective, as is evidenced by your null columns.
  • Don't prefix columns with the name of the table, fully qualify them as needed.
  • Separate the message from its localization for referential integrity.
create table languages (
  id serial primary key,
  tag text unique not null
);

-- This table might seem a bit silly, but it allows tables to refer to
-- a single message and enforce referential integrity.
create table messages (
  id bigserial primary key
);

create table message_localizations (
  -- If a message or language is deleted, so will its localization.
  message_id bigint not null references messages on delete cascade,
  language_id int not null references languages on delete cascade,

  localization text not null,

  -- This serves as both a primary key and enforces one localization
  -- per message and language.
  primary key(message_id, language_id)
);

create table events (
  id bigserial primary key,
  name_id bigint not null references messages,
  description_id bigint not null references messages
);

Then join each message with its localization.

select
  events.id,
  ml1.localization as name,
  ml2.localization as description
from events
-- left join so there is a result even if there is no localization. YMMV.
left join languages lt on lt.tag = 'en-US'
left join message_localizations ml1
  on ml1.message_id = name_id and ml1.language_id = lt.id
left join message_localizations ml2
  on ml2.message_id = description_id and ml2.language_id = lt.id

Demonstration.

But, again, you probably want to use an existing localization tool.

  • Related