I have been learning how utf16 endianness are represented in C and Python. And from How is unicode represented internally in Python? :
u2049
is then represented as either\x49\x20
or\x20\x49
or\x49\x20\x00\x00
or\x00\x00\x20\x49
depending on the native byte order of your system and if UCS2 or UCS4 was picked.
ord('⁉') # gives 8265 in decimal
Check on https://unicode-table.com/en/2049/ :
8265
is utf16 be
in decimall, I expect it to be 18720
in decimal, which is utf16 le
.
CodePudding user response:
There is a confusion between the byte representation of number, the byte representation of an unicode character and the value of the unicode code point. The value is a plain integer and does not depend on endiannes. 8265 or 0x2049 is the code point of the unicode character U 2049 EXCLAMATION QUESTION MARK (⁉
). Full stop.
That character itself is represented as the byte string b'\x20\x49'
in UTF-16BE encoding and b'\x49\x20'
in UTF-16LE encoding.
Its internal byte representation is indeed
\x49\x20 or \x20\x49 or \x49\x20\x00\x00 or \x00\x00\x20\x49 depending on the native byte order of your system and if UCS2 or UCS4 was picked.
But you really should not care about it. You have no way to access it from Python language, so it only matters if you write a C module and do not want to use the Unicode library functions. Said differently, unless you want to build a Python interpretor in C language, the internal representation of a unicode character should be seen as an implementation detail.