I am looking for help in understanding the issue with retrieving certain characters in MYSQL. I have a table which contains international numerals in unicode for each numeral system (e.g. Roman, Hebrew, Thai, Khmer, Lao, etc). The character set of the table is utf8 and collation is utf8_general_ci.
The table is organised as follows:
NUM_SYS_NAME | NUM_ID | TEXT |
---|---|---|
Roman | 1 | I |
Roman | 2 | II |
Roman | 3 | III |
Roman | 5 | V |
Thai | 5 | ๕ |
Ethiopic | 500 | ፭፻ |
etc, there are 18 numbers for each numeral system, from 0 to 10, also 50, 100, 500, 1000, 10000.
The MySql query is this:
SELECT NUM_SYS_NAME,
max(case
when NUM_ID = 0 then
TEXT
else
'N/A'
end) as '0',
max(case
when NUM_ID = 1 then
TEXT
else
'N/A'
end) as '1',
max(case
when NUM_ID = 2 then
TEXT
else
'N/A'
end) as '2',
max(case
when NUM_ID = 3 then
TEXT
else
'N/A'
end) as '3',
max(case
when NUM_ID = 4 then
TEXT
else
'N/A'
end) as '4',
max(case
when NUM_ID = 5 then
TEXT
else
'N/A'
end) as '5',
max(case
when NUM_ID = 6 then
TEXT
else
'N/A'
end) as '6',
max(case
when NUM_ID = 7 then
TEXT
else
'N/A'
end) as '7',
max(case
when NUM_ID = 8 then
TEXT
else
'N/A'
end) as '8',
max(case
when NUM_ID = 9 then
TEXT
else
'N/A'
end) as '9',
max(case
when NUM_ID = 10 then
TEXT
else
'N/A'
end) as '10',
max(case
when NUM_ID = 50 then
TEXT
else
'N/A'
end) as '50',
max(case
when NUM_ID = 100 then
TEXT
else
'N/A'
end) as '100',
max(case
when NUM_ID = 500 then
TEXT
else
'N/A'
end) as '500',
max(case
when NUM_ID = 1000 then
TEXT
else
'N/A'
end) as '1000',
max(case
when NUM_ID = 10000 then
TEXT
else
'N/A'
end) as '10000'
from numerals
WHERE NUM_ID IN
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 50, 100, 500, 1000, 10000)
GROUP BY NUM_SYS_NAME
The results are properly retrieved and formatted in a crosstab as expected, except for Roman numerals from 1 to 4 and also 9, 50, 100, 1000, 10000. So, for example, Roman numeral 5 will be displayed as V and 3 as N/A. If I set the value of the Roman numeral 3 to ZZZ instead of III is it then displayed properly in query results.
I have experimented with various collations and also setting the values in TEXT field to other characters for Roman numeral system and have found that any character from O to Z is displayed properly, however, anything else results in an N/A value, hence all numerals starting with I, L, D, C in Roman system are not retrieved. I am cracking my head now for 2 days to find the root cause of the issue and believe this has to do something a collation and mixing multiple languages, but not very sure, moreover, it is bizarre that the issue is with basic latin characters rather than the exotic ones.
Would appreciate your advice and to be pointed to the right direction. So far, could not even find a similar issue via search or help docs.
Thanks in advance! Alex
P.S. Table definition: numerals Table definition
P.S. 2 Fiddle: https://www.db-fiddle.com/f/op6eAg6s6mDWJiFRYXyYFS/3
CodePudding user response:
The encoding isn't the issue, it's about using MAX()
max(case when NUM_ID = 4 then TEXT else 'N/A' end) as '4',
<-- the string 'N/A'
is greater than the string 'IV'
, in alphabetic order, and then, the 'N/A'
is selected over the 'IV'
, or anything that has sorting order lesser than 'N/A'
. That's why the letter 'O'
works, because it's after 'N'
.
A workaround would be to replace the string 'N/A'
by the value NULL
:
MAX(case when NUM_ID = 4 then TEXT else NULL end) as '4'