Home > Net >  Why ord gives big endianness result while my platform is little endianness?
Why ord gives big endianness result while my platform is little endianness?

Time:02-11

I have been learning how utf16 endianness are represented in C and Python. And from How is unicode represented internally in Python? :

u2049 is then represented as either \x49\x20 or \x20\x49 or \x49\x20\x00\x00 or \x00\x00\x20\x49 depending on the native byte order of your system and if UCS2 or UCS4 was picked.

ord('⁉') # gives 8265 in decimal 

Check on https://unicode-table.com/en/2049/ :

8265 is utf16 be in decimall, I expect it to be 18720 in decimal, which is utf16 le .

CodePudding user response:

There is a confusion between the byte representation of number, the byte representation of an unicode character and the value of the unicode code point. The value is a plain integer and does not depend on endiannes. 8265 or 0x2049 is the code point of the unicode character U 2049 EXCLAMATION QUESTION MARK (). Full stop.

That character itself is represented as the byte string b'\x20\x49' in UTF-16BE encoding and b'\x49\x20' in UTF-16LE encoding.

Its internal byte representation is indeed

\x49\x20 or \x20\x49 or \x49\x20\x00\x00 or \x00\x00\x20\x49 depending on the native byte order of your system and if UCS2 or UCS4 was picked.

But you really should not care about it. You have no way to access it from Python language, so it only matters if you write a C module and do not want to use the Unicode library functions. Said differently, unless you want to build a Python interpretor in C language, the internal representation of a unicode character should be seen as an implementation detail.

  • Related