How can I clean html code data in a DataFrame?-CodePudding

I called an api, and put it in a DataFrame.
There is a column with a lot of rows and the values are HTML code. I would like to clean the HTML code, only the wording itself. How can I do that?

Example:

<p><span style="color: red;">家庭旅遊保險計劃</span><span style="font-size: 11.5pt;color: red;">包</span><span style="font-size: 11.5pt;color: red;">括</span><span style="color: red;">申請人及配偶，及<b>免費</b>最多其四名</span><span style="color: red;">18</span><span style="color: red;">歲以下之同行子女</span><span style="color: red;"></span></p><p>每人每次最高賠償額* (港元)</p><p><b>金計劃</b><span style="font-size: 11.5pt;color: black;"><br/> </span>醫療費用 $1,000,000<span style="font-size: 11.5pt;color: black;"><br/> </span>個人意外 $1,000,000<span style="font-size: 11.5pt;color: black;"><br/> </span>手機意外損毀或遺失 $3,000</p><p><b>銅計劃</b><span style="font-size: 11.5pt;color: black;"><br/> </span>醫療費用 $250,000<span style="font-size: 11.5pt;color: black;"><br/> </span>個人意外 $250,000</p><p>*18歲以下及75歲以上之受保人於計劃內的保障額將會減少<br/></p>

CodePudding user response：

Firstly, run this in a terminal:
pip install beautifulsoup4

Afterwards, apply a proper function to the html column of your pandas dataframe (See below).

Code:

from bs4 import BeautifulSoup
import pandas as pd

# Create a sample dataframe
html = '<p><span style="color: red;">家庭旅遊保險計劃</span><span style="font-size: 11.5pt;color: red;">包</span><span style="font-size: 11.5pt;color: red;">括</span><span style="color: red;">申請人及配偶，及<b>免費</b>最多其四名</span><span style="color: red;">18</span><span style="color: red;">歲以下之同行子女</span><span style="color: red;"></span></p><p>每人每次最高賠償額* (港元)</p><p><b>金計劃</b><span style="font-size: 11.5pt;color: black;"><br/> </span>醫療費用 $1,000,000<span style="font-size: 11.5pt;color: black;"><br/> </span>個人意外 $1,000,000<span style="font-size: 11.5pt;color: black;"><br/> </span>手機意外損毀或遺失 $3,000</p><p><b>銅計劃</b><span style="font-size: 11.5pt;color: black;"><br/> </span>醫療費用 $250,000<span style="font-size: 11.5pt;color: black;"><br/> </span>個人意外 $250,000</p><p>*18歲以下及75歲以上之受保人於計劃內的保障額將會減少<br/></p>'
df = pd.DataFrame([{'html': html}])

# Extract text from html
df['extracted'] = df.html.apply(lambda s: BeautifulSoup(s).text)

Output:

	html	extracted
0	<p><span style="color: red;">家庭旅遊保險計劃</span><span style="font-size: 11.5pt;color: red;">包</span><span style="font-size: 11.5pt;color: red;">括</span><span style="color: red;">申請人及配偶，及<b>免費</b>最多其四名</span><span style="color: red;">18</span><span style="color: red;">歲以下之同行子女</span><span style="color: red;"></span></p><p>每人每次最高賠償額* (港元)</p><p><b>金計劃</b><span style="font-size: 11.5pt;color: black;"><br/> </span>醫療費用 $1,000,000<span style="font-size: 11.5pt;color: black;"><br/> </span>個人意外 $1,000,000<span style="font-size: 11.5pt;color: black;"><br/> </span>手機意外損毀或遺失 $3,000</p><p><b>銅計劃</b><span style="font-size: 11.5pt;color: black;"><br/> </span>醫療費用 $250,000<span style="font-size: 11.5pt;color: black;"><br/> </span>個人意外 $250,000</p><p>*18歲以下及75歲以上之受保人於計劃內的保障額將會減少<br/></p>	家庭旅遊保險計劃包括申請人及配偶，及免費最多其四名18歲以下之同行子女每人每次最高賠償額* (港元)金計劃醫療費用 $1,000,000 個人意外 $1,000,000 手機意外損毀或遺失 $3,000銅計劃醫療費用 $250,000 個人意外 $250,000*18歲以下及75歲以上之受保人於計劃內的保障額將會減少

CodePudding user response：

You can use beautiful soup to find the HTML class and then use .text to store it in a list.


title = new_soup.find("HTML tag", class_="add class name of HTML tag") 
name.append(title.text)

It will filter out everything except text.