Unwanted characters in the HTML beautified text-CodePudding

I have my original web scraped HTML text as this

> {"overview":"\\u003cp\\u003e\\u003cspan style=\\"font-size:
> 10.5pt;\\"\\u003e\\u003cspan class=\\"TextRun SCXW87260372 BCX0\\" style=\\"margin: 0px; padding: 0px; -webkit-user-drag: none;
> -webkit-tap-highlight-color: transparent; color: #000000; font-family: \'Meiryo UI\', \'Meiryo UI_MSFontService\', sans-serif; font-kerning:
> none; line-height: 15.1083px; font-variant-ligatures: none
> !important;\\"\\u003e\\u003cspan class=\\"NormalTextRun SCXW87260372
> BCX0\\" style=\\"margin: 0px; padding: 0px; -webkit-user-drag: none;
> -webkit-tap-highlight-color: transparent; background-color: inherit;\\"\\u003eFioriアプリの動作確認で、２通りのトラブルシューティングをする\\u003c/span\\u003e\\u003c/span\\u003e\\u003cspan
> class=\\"EOP SCXW87260372 BCX0\\" style=\\"margin: 0px;.....

I used the BeautifulSoup to eliminate all the HTML tags using the below code

def beautify_full_text(content):
    try:
        soup = BeautifulSoup(content.encode('utf-8').decode('unicode-escape'), "html.parser")
        for tag in soup():
            for attribute in ["class", "id", "name", "style"]:
                del tag[attribute]
    
        return os.linesep.join([s for s in soup.text.splitlines() if s])
    except Exception as e:
        print(e)
        return

I now see that the returned text has no HTML Tags but has the below text

{"overview":"Fioriã\x82¢ã\x83\x97ã\x83ªã\x81®å\x8b\x95ä½\x9cç¢ºèª\x8dã\x81§ã\x80\x81ï¼\x92é\x80\x9aã\x82\x8aã\x81®ã\x83\x88ã\x83©ã\x83\x96ã\x83«ã\x82·ã\x83¥ã\x83¼ã\x83\x86ã\x82£ã\x83³ã\x82°ã\x82\x92ã\x81\x99ã\x82\x8bÂ\xa0\nGatewayã\x81®ã\x82¨ã\x83©ã\x83¼ã\x83\xadã\x82°ã\x82\x92ç¢ºèª\x8dÂ\xa0\nã\x83\x96ã\x83©ã\x82¦ã\x82¶ã\x81®ã\x82³ã\x83³ã\x82½ã\x83¼ã\x83«ã\x81§ICFã\x82µã\x83¼ã\x83\x93ã\x82¹ç\xad\x89ã\x81§403/403ã\x81\x8cå\x87ºã\x81¦ã\x81\x84ã\x81ªã\x81\x84ã\x81\x8b\nÂ\xa0â\x80¯Â\xa0[Gateway
Foundation] Which Tools Can Be Used for
Troubleshooting?Â\xa0\næ¥µå\x8a\x9bã\x83\xadã\x82°ã\x82ªã\x83³è¨\x80èª\x9eï¼\x9dè\x8b±èª\x9eã\x81«ã\x81\x97ã\x81¦ã\x80\x81ã\x80\x8cggrksã\x80\x8dã\x82\x92ã\x82ªã\x83\x96ã\x83©ã\x83¼ã\x83\x88ã\x81«å\x8c\nã\x82\x93ã\x81§è¨\x80ã\x81\x86Â\xa0\n"}

Is there a way I can eliminate these unwanted characters as well?

CodePudding user response：

The problem with the unicode-escape codec is that it decodes the escape codes, but also decodes to latin1. Since you have non-latin1 characters in the stream, re-encode as latin1 to undo the incorrect decoding and decode as utf8 again:

s='''\
{"overview":"\\u003cp\\u003e\\u003cspan style=\\"font-size:
10.5pt;\\"\\u003e\\u003cspan class=\\"TextRun SCXW87260372 BCX0\\" style=\\"margin: 0px; padding: 0px; -webkit-user-drag: none;
-webkit-tap-highlight-color: transparent; color: #000000; font-family: \'Meiryo UI\', \'Meiryo UI_MSFontService\', sans-serif; font-kerning:
none; line-height: 15.1083px; font-variant-ligatures: none
!important;\\"\\u003e\\u003cspan class=\\"NormalTextRun SCXW87260372
BCX0\\" style=\\"margin: 0px; padding: 0px; -webkit-user-drag: none;
-webkit-tap-highlight-color: transparent; background-color: inherit;\\"\\u003eFioriアプリの動作確認で、２通りのトラブルシューティングをする\\u003c/span\\u003e\\u003c/span\\u003e\\u003cspan
class=\\"EOP SCXW87260372 BCX0\\" style=\\"margin: 0px;'''

print(s.encode('utf8').decode('unicode-escape').encode('latin1').decode('utf8'))

Output:

{"overview":"<p><span style="font-size:
10.5pt;"><span  style="margin: 0px; padding: 0px; -webkit-user-drag: none;
-webkit-tap-highlight-color: transparent; color: #000000; font-family: 'Meiryo UI', 'Meiryo UI_MSFontService', sans-serif; font-kerning:
none; line-height: 15.1083px; font-variant-ligatures: none
!important;"><span  style="margin: 0px; padding: 0px; -webkit-user-drag: none;
-webkit-tap-highlight-color: transparent; background-color: inherit;">Fioriアプリの動作確認で、２通りのトラブルシューティングをする</span></span><span
 style="margin: 0px;

Now that it is decoded, it looks more like it was a JSON response. If you used the requests module to retrieve the data look at response.json() to see if it decodes correctly, or use json.loads() on your scraped string.

CodePudding user response：

It turns out that a small tweak solved the problem. Currently, the code looks as below

def beautify_full_text(content):
    try:
        soup = BeautifulSoup(content.encode('utf-8').decode('unicode-escape'), "html.parser")
        for tag in soup():
            for attribute in ["class", "id", "name", "style"]:
                del tag[attribute]
    
        beau_text = os.linesep.join([s for s in soup.text.splitlines() if s])
        beau_text = beau_text.encode("ascii", "ignore").decode()
        return beau_text
    except Exception as e:
        print(e)
        return