Home > Software engineering >  Why does unicodedata.name() raise "ValueError: no such name" for some ASCII characters?
Why does unicodedata.name() raise "ValueError: no such name" for some ASCII characters?

Time:11-23

unicodedata.name('\x00') raises a ValueError exception:

Python 3.8.10 (default, Sep 28 2021, 16:10:42) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information
>>> import unicodedata
>>> unicodedata.name('\x00')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name

\x00 is the NUL character. Why does unicodedata.name('\x00') raise an exception? I am getting the same error for other non-printable ASCII characters (\x00 to \x1F, and \x7F). Is unicodedata.name() only for printable characters? If so, where is it mentioned in the Python documentation?

CodePudding user response:

If you look at what the name of a unicode character means, it refers to this list: https://www.unicode.org/Public/13.0.0/ucd/NamesList.txt

As you can read, all the non-printable ASCII control characters are named "<control>": "NULL" is not the name of 0000, it's an alias.

Now, why doesn't Python display "<control>" is another question that I can't answer.

CodePudding user response:

As per this Wikipedia article, Cc control characters have no name in Unicode. All the characters you mentioned are categorized under Cc category(You can confirm this by using unicodedata.category API)

>>> import unicodedata
>>> unicodedata.category('\x00')
'Cc'
>>> unicodedata.category('\x1F')
'Cc'
>>> unicodedata.category('\x7F')
'Cc'

In Unicode, "Control-characters" are U 0000—U 001F (C0 controls), U 007F (delete), and U 0080—U 009F (C1 controls). Their General Category is "Cc". Formatting codes are distinct, in General Category "Cf". The Cc control characters have no Name in Unicode, but are given labels such as "<control-001A>" instead.

You can also see CONTROL CHARACTERs are explicitly handled in cpython source code

  • Related