Input:
Any HTML file that contains bold and non-bold text, distributed across different types of tags (e.g. <div>, <span>, <p>, <i>, <td>
, etc.)
Desired Output: A data structure (e.g. data frame or dictionary) that allows me to collect all the text elements of the HTML file, along with the information if a text element in a certain tag was bold or not. For example:
data = {'Text': ['bold text (1)', "text (2)", "text (3)", "bold text (4)"], 'Bold': ["yes", "no", "no", "yes"]}
df = pd.DataFrame(data)
Notes:
To my knowledge, a bold text can be located either inbetween two <b>...</b>
tags, or with any arbitrary tag, have the attribute style="font-weight:700;" or style="font-weight:bold;", for example <span style="font-weight:700;">...</span>
.
Reproducable Example: This is my sample html file with 15 text elements, out of which 4 are bold:
<html><head><title>text (1)</title></head><body><div>text (2)</div><div>text (3)</div><div><span>text (4)</span></div><div>text (5)</div><div><span>text (6)</span><span>text (7)</span></div><div><span style="font-weight:bold;">bold text (8)</span></div><div><span>text (9)</span></div><div><span style="font-weight:700;">bold text (10)</span></div><div><span>text (11)</span></div><div><span><b>bold text (12)</b></span></div><div><span>text (13)</span><span><a href="www.google.de"><b>bold text (14)</b></a></span></div><div><span>text (15)</span></div></body></html>
I figured out how to get all the text elements with beautiful soup...
from bs4 import BeautifulSoup
with open(html_file, 'r') as f:
# create soup object of .html file
soup = BeautifulSoup(f, 'html.parser')
soup.findAll(text=True, recursive=True)
# output: ['text (1)', 'text (2)', 'text (3)', 'text (4)', 'text (5)', 'text (6)', 'text (7)', 'bold text (8)', 'text (9)', 'bold text (10)', 'text (11)', 'bold text (12)', 'text (13)', 'bold text (14)', 'text (15)']
...but I cannot figure out how to get the information about the tag attributes (font-weight) and neither how to check if a tag was <b>...</b>
or not. Can you please give me a hint?
CodePudding user response:
You could check the texts parent
if its name
is b
or for an existing attribute
style to get a step closer:
for e in soup.find_all(text=True, recursive=True):
data.append({
'text':e,
'isBoldTag': True if e.parent.name == 'b' else False,
'isBoldStyle': True if e.parent.get('style') and 'font-weight' in e.parent.get('style') else False
})
Example
from bs4 import BeautifulSoup
html='''<html><head><title>text (1)</title></head><body><div>text (2)</div><div>text (3)</div><div><span>text (4)</span></div><div>text (5)</div><div><span>text (6)</span><span>text (7)</span></div><div><span style="font-weight:bold;">bold text (8)</span></div><div><span>text (9)</span></div><div><span style="font-weight:700;">bold text (10)</span></div><div><span>text (11)</span></div><div><span><b>bold text (12)</b></span></div><div><span>text (13)</span><span><a href="www.google.de"><b>bold text (14)</b></a></span></div><div><span>text (15)</span></div></body></html>'''
soup = BeautifulSoup(html)
data = []
for e in soup.find_all(text=True, recursive=True):
data.append({
'text':e,
'isBoldTag': True if e.parent.name == 'b' else False,
'isBoldStyle': True if e.parent.get('style') and 'font-weight' in e.parent.get('style') else False
})
data
Output
[{'text': 'text (1)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (2)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (3)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (4)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (5)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (6)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (7)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (8)', 'isBoldTag': False, 'isBoldStyle': True}, {'text': 'text (9)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (10)', 'isBoldTag': False, 'isBoldStyle': True}, {'text': 'text (11)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (12)', 'isBoldTag': True, 'isBoldStyle': False}, {'text': 'text (13)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (14)', 'isBoldTag': True, 'isBoldStyle': False}, {'text': 'text (15)', 'isBoldTag': False, 'isBoldStyle': False}]
Or as DataFrame -> pd.DataFrame(data)
text | isBoldTag | isBoldStyle | |
---|---|---|---|
0 | text (1) | False | False |
1 | text (2) | False | False |
2 | text (3) | False | False |
3 | text (4) | False | False |
4 | text (5) | False | False |
5 | text (6) | False | False |
6 | text (7) | False | False |
7 | bold text (8) | False | True |
8 | text (9) | False | False |
9 | bold text (10) | False | True |
10 | text (11) | False | False |
11 | bold text (12) | True | False |
12 | text (13) | False | False |
13 | bold text (14) | True | False |
14 | text (15) | False | False |