Extract all text from HTML file while checking for boldness (Python)-CodePudding

Input: Any HTML file that contains bold and non-bold text, distributed across different types of tags (e.g. <div>, , , , <td>, etc.)

Desired Output: A data structure (e.g. data frame or dictionary) that allows me to collect all the text elements of the HTML file, along with the information if a text element in a certain tag was bold or not. For example:

data = {'Text': ['bold text (1)', "text (2)", "text (3)", "bold text (4)"], 'Bold': ["yes", "no", "no", "yes"]}
df = pd.DataFrame(data)

Notes: To my knowledge, a bold text can be located either inbetween two ... tags, or with any arbitrary tag, have the attribute style="font-weight:700;" or style="font-weight:bold;", for example ....

Reproducable Example: This is my sample html file with 15 text elements, out of which 4 are bold:

<html><head><title>text (1)</title></head><body><div>text (2)</div><div>text (3)</div><div><span>text (4)</span></div><div>text (5)</div><div><span>text (6)</span><span>text (7)</span></div><div><span style="font-weight:bold;">bold text (8)</span></div><div><span>text (9)</span></div><div><span style="font-weight:700;">bold text (10)</span></div><div><span>text (11)</span></div><div><span><b>bold text (12)</b></span></div><div><span>text (13)</span><span><a href="www.google.de"><b>bold text (14)</b></a></span></div><div><span>text (15)</span></div></body></html>

I figured out how to get all the text elements with beautiful soup...

from bs4 import BeautifulSoup
with open(html_file, 'r') as f:
    # create soup object of .html file
    soup = BeautifulSoup(f, 'html.parser')
    soup.findAll(text=True, recursive=True)

# output: ['text (1)', 'text  (2)', 'text (3)', 'text (4)', 'text (5)', 'text (6)', 'text (7)', 'bold text (8)', 'text (9)', 'bold text (10)', 'text (11)', 'bold text (12)', 'text (13)', 'bold text (14)', 'text (15)']

...but I cannot figure out how to get the information about the tag attributes (font-weight) and neither how to check if a tag was ... or not. Can you please give me a hint?

CodePudding user response：

You could check the texts parent if its name is b or for an existing attribute style to get a step closer:

for e in soup.find_all(text=True, recursive=True):
    data.append({
        'text':e,
        'isBoldTag': True if e.parent.name == 'b' else False,
        'isBoldStyle':  True if e.parent.get('style') and 'font-weight' in e.parent.get('style') else False
    })

Example

from bs4 import BeautifulSoup

html='''<html><head><title>text (1)</title></head><body><div>text (2)</div><div>text (3)</div><div><span>text (4)</span></div><div>text (5)</div><div><span>text (6)</span><span>text (7)</span></div><div><span style="font-weight:bold;">bold text (8)</span></div><div><span>text (9)</span></div><div><span style="font-weight:700;">bold text (10)</span></div><div><span>text (11)</span></div><div><span><b>bold text (12)</b></span></div><div><span>text (13)</span><span><a href="www.google.de"><b>bold text (14)</b></a></span></div><div><span>text (15)</span></div></body></html>'''

soup = BeautifulSoup(html)

data = []

for e in soup.find_all(text=True, recursive=True):
    data.append({
        'text':e,
        'isBoldTag': True if e.parent.name == 'b' else False,
        'isBoldStyle':  True if e.parent.get('style') and 'font-weight' in e.parent.get('style') else False
    })

data

Output

[{'text': 'text (1)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (2)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (3)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (4)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (5)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (6)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (7)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (8)', 'isBoldTag': False, 'isBoldStyle': True}, {'text': 'text (9)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (10)', 'isBoldTag': False, 'isBoldStyle': True}, {'text': 'text (11)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (12)', 'isBoldTag': True, 'isBoldStyle': False}, {'text': 'text (13)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (14)', 'isBoldTag': True, 'isBoldStyle': False}, {'text': 'text (15)', 'isBoldTag': False, 'isBoldStyle': False}]

Or as DataFrame -> pd.DataFrame(data)

	text	isBoldTag	isBoldStyle
0	text (1)	False	False
1	text (2)	False	False
2	text (3)	False	False
3	text (4)	False	False
4	text (5)	False	False
5	text (6)	False	False
6	text (7)	False	False
7	bold text (8)	False	True
8	text (9)	False	False
9	bold text (10)	False	True
10	text (11)	False	False
11	bold text (12)	True	False
12	text (13)	False	False
13	bold text (14)	True	False
14	text (15)	False	False