PostgreSQL sort order affected by succeeding characters-CodePudding

I want to sort test last name and test2 last name so that the former comes before the latter. My understanding is that each character is compared from left to right until they differ and therefore the characters after those first differing characters do not matter anymore. However, as shown below, test comes before test2 but as soon as I append another character, the order changes. Why does this happen? What collation should I use to get the desired order? Note that converting them to bytea would yield the desired order.

test=# SELECT 'test last name' < 'test2 last name' COLLATE "en_US";
 ?column?
----------
 f
(1 row)

test=# SELECT 'test last' < 'test2 last' COLLATE "en_US";
 ?column?
----------
 f
(1 row)

test=# SELECT 'test ' < 'test2 ' COLLATE "en_US";
 ?column?
----------
 t
(1 row)

test=# SELECT 'test ' < 'test2' COLLATE "en_US";
 ?column?
----------
 t
(1 row)

test=# SELECT 'test' < 'test2' COLLATE "en_US";
 ?column?
----------
 t
(1 row)

test=# SELECT 'test  ' < 'test2' COLLATE "en_US";
 ?column?
----------
 t
(1 row)

test=# SELECT 'test  ' < 'test2 ' COLLATE "en_US";
 ?column?
----------
 t
(1 row)

test=# SELECT 'test  ' < 'test2  ' COLLATE "en_US";
 ?column?
----------
 t
(1 row)

test=# SELECT 'test  ' < 'test2 ' COLLATE "en_US";
 ?column?
----------
 t
(1 row)

test=# SELECT 'test  ' < 'test2 l' COLLATE "en_US";
 ?column?
----------
 t
(1 row)

test=# SELECT 'test ' < 'test2 l' COLLATE "en_US";
 ?column?
----------
 t
(1 row)

test=# SELECT 'test l' < 'test2 l' COLLATE "en_US";
 ?column?
----------
 f
(1 row)

test=# SELECT 'test l' < 'test2l' COLLATE "en_US";
 ?column?
----------
 f
(1 row)

test=# SELECT 'test ' < 'test2l' COLLATE "en_US";
 ?column?
----------
 t
(1 row)

test=# SELECT 'test last name'::bytea < 'test2 last name'::bytea;
 ?column? 
----------
 t
(1 row)

CodePudding user response：

white space is special character in ICU collation.

see demo: https://www.unicode.org/reports/tr10/#Variable_Weighting_Examples
also here: http://www.unicode.org/reports/tr35/tr35-collation.html#table-collation-settings
simple explanation: https://unicode-org.github.io/icu/userguide/collation/customization/ignorepunct.html#shift-trimmed

You can following test:

CREATE COLLATION coll_shifted(provider = icu, locale = 'en-u-ka-shifted');
CREATE COLLATION coll_noignore(provider = icu, locale = 'en-u-ka-noignore');

SELECT 'test last name' < 'test2 last name' COLLATE coll_shifted
union all
SELECT 'test last name' < 'test2 last name' COLLATE coll_noignore
union all
SELECT 'test last name' < 'test2 last name' COLLATE "en_US";

if you just want compare by code pointer, you can use COLLATE "C" or COLLATE "POSIX".

CodePudding user response：

That's how natural language collations work. If you want to compare character by character and have the space character be like other characters, use the C collation:

SELECT 'test last name' < 'test2 last name' COLLATE "C";

 ?column? 
══════════
 t
(1 row)

But don't complain if 'Z' < 'a' …