unicodedata.name('\x00')
raises a ValueError
exception:
Python 3.8.10 (default, Sep 28 2021, 16:10:42)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information
>>> import unicodedata
>>> unicodedata.name('\x00')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
\x00
is the NUL character. Why does unicodedata.name('\x00')
raise an exception? I am getting the same error for other non-printable ASCII characters (\x00
to \x1F
, and \x7F
). Is unicodedata.name()
only for printable characters? If so, where is it mentioned in the Python documentation?
CodePudding user response:
If you look at what the name of a unicode character means, it refers to this list: https://www.unicode.org/Public/13.0.0/ucd/NamesList.txt
As you can read, all the non-printable ASCII control characters are named "<control>
": "NULL
" is not the name of 0000, it's an alias.
Now, why doesn't Python display "<control>
" is another question that I can't answer.
CodePudding user response:
As per this Wikipedia article, Cc
control characters have no name in Unicode. All the characters you mentioned are categorized under Cc
category(You can confirm this by using unicodedata.category
API)
>>> import unicodedata
>>> unicodedata.category('\x00')
'Cc'
>>> unicodedata.category('\x1F')
'Cc'
>>> unicodedata.category('\x7F')
'Cc'
In Unicode, "Control-characters" are U 0000—U 001F (C0 controls), U 007F (delete), and U 0080—U 009F (C1 controls). Their General Category is "Cc". Formatting codes are distinct, in General Category "Cf". The
Cc
control characters have no Name in Unicode, but are given labels such as"<control-001A>"
instead.
You can also see CONTROL CHARACTERs are explicitly handled in cpython source code