Home > Enterprise >  Some annoying characters are not normalised by unicodedata
Some annoying characters are not normalised by unicodedata

Time:11-11

I have a python string that looks like as shown below. This string is from the SEC filing of one public company in the US. I am trying to remove some annoying characters from the string using unicodedata.normalise function, but this is not removing all characters. What could be the reason behind such behavior?

from unicodedata import normalize
s = '[email protected]\nFacsimile\nNo.:\xa0 312-233-2266\n\xa0\nJPMorgan Chase Bank,\nN.A., as Administrative Agent\n10 South Dearborn, Floor 7th\nIL1-0010\nChicago, IL 60603-2003\nAttention:\xa0 Hiral Patel\nFacsimile No.:\xa0 312-385-7096\n\xa0\nLadies and Gentlemen:\n\xa0\nReference is made to the\nCredit Agreement, dated as of May\xa07, 2010 (as the same may be amended,\nrestated, supplemented or otherwise modified from time to time, the \x93Credit Agreement\x94), by and among\nHawaiian Electric Industries,\xa0Inc., a Hawaii corporation (the \x93Borrower\x94), the Lenders from time to\ntime party thereto and JPMorgan Chase Bank, N.A., as issuing bank and\nadministrative agent (the \x93Administrative Agent\x94).'

normalize('NFKC', s)
'[email protected]\nFacsimile\nNo.:  312-233-2266\n \nJPMorgan Chase Bank,\nN.A., as Administrative Agent\n10 South Dearborn, Floor 7th\nIL1-0010\nChicago, IL 60603-2003\nAttention:  Hiral Patel\nFacsimile No.:  312-385-7096\n \nLadies and Gentlemen:\n \nReference is made to the\nCredit Agreement, dated as of May 7, 2010 (as the same may be amended,\nrestated, supplemented or otherwise modified from time to time, the \x93Credit Agreement\x94), by and among\nHawaiian Electric Industries, Inc., a Hawaii corporation (the \x93Borrower\x94), the Lenders from time to\ntime party thereto and JPMorgan Chase Bank, N.A., as issuing bank and\nadministrative agent (the \x93Administrative Agent\x94).'

As one can see from the outputs, the characters \xa0 is handled properly, but the characters like \x92, \x93 and \x94 are not normalized and are as it is in the result string.

CodePudding user response:

Your data was decoded as ISO-8859-1 (aka latin1), but those Unicode code points are control characters in that encoding. In Windows-1252 (aka cp1252) they are so-called smart quotes:

>>> '\x92\x93\x94'.encode('latin1').decode('cp1252')
'’“”'

They also don't change when normalized, but at least they display correctly if decoded properly:

>>> ud.normalize('NFKC','\x92\x93\x94'.encode('latin1').decode('cp1252'))
'’“”'
>>> print(s.encode('latin1').decode('cp1252'))
[email protected]
Facsimile
No.:  312-233-2266
 
JPMorgan Chase Bank,
N.A., as Administrative Agent
10 South Dearborn, Floor 7th
IL1-0010
Chicago, IL 60603-2003
Attention:  Hiral Patel
Facsimile No.:  312-385-7096
 
Ladies and Gentlemen:
 
Reference is made to the
Credit Agreement, dated as of May 7, 2010 (as the same may be amended,
restated, supplemented or otherwise modified from time to time, the “Credit Agreement”), by and among
Hawaiian Electric Industries, Inc., a Hawaii corporation (the “Borrower”), the Lenders from time to
time party thereto and JPMorgan Chase Bank, N.A., as issuing bank and
administrative agent (the “Administrative Agent”).

Note the \xa0 code point is U 00A0 (NO-BREAK SPACE) and canonically normalizes to a SPACE:

>>> ud.name('\xa0')
'NO-BREAK SPACE'
>>> ud.normalize('NFKC','\xa0')
' '
>>> ud.name(ud.normalize('NFKC','\xa0'))
'SPACE'

It prints correctly without normalization:

>>> print('hello\xa0there')
hello there

CodePudding user response:

unicodedata.normalize is not meant to "remove [...] characters". It is there so that Unicode strings that might be equivalent, but written with different representations, can be cast to a uniform representation -- but it will not mutilate the text to drop characters that "don't look good". What happens with \xa0 (non-breaking space) in particular is that it is equivalent to a common plain space (\x20) in the normalized forms, and is therefore replaced by that.

That said, it looks like the application that generates the data you are consuming just includes these characters with semantic purpose; their meaning is here: C0 and C1 control codes - Wikipedia. If you want just to discard that information and preserve other non-ASCII characters in your text, a replace for all characters in the C1 block range, after normalizing, will do the job. re.sub might be nice due to being able to allow the selection of a character range:

import re
...
s1 = normalize("NFKC", s)
s2 = re.sub("[\x1f-\x9f]", "", s1)

If you want to simply drop all non-ASCII characters (not recommended if your source is not ASCII ctrl characters only), on the other hand, just encode the text with "ignore" as error policy:

s2 = s1.decode("ASCII", errors="ignore").decode()
  • Related