Home > Enterprise >  Normalize string from webpage
Normalize string from webpage

Time:10-04

Trying to normalize the string "PartII\xa0I \x96 FINANCIAL\n INFORMATION". In general, all that should be left (once non utf-8 characters are excluded) are letters, numbers and dots. Therefore the expected output is "PartII FINANCIAL INFORMATION". The text comes from this Sec form.

Solutions tried, where text is the string:

  1. text.encode('utf-8', errors='ignore').decode('utf-8')
  2. unicodedata.normalize(decoding, text)

CodePudding user response:

Use this it will work for you:

text.encode('ascii', errors='ignore').decode('utf-8')

also if you need to remove \n use this:

text.replace('\n', "").encode('ascii', errors='ignore').decode('utf-8')
  • Related