I have looked over various methods from BeautifulSoup
to XML parsers and I think that there must be a simpler way to iterate over an HTML file to parse information into a dataframe table. There is a lot of information with specific section headers:
<h2 >CHAPTER 1</h2>
<p style="line-height: 100%; margin-bottom: 0.08in"><b>1</b>text
<p style="line-height: 100%; margin-bottom: 0.08in"><b>Header</b></p>
<p align="left" style="line-height: 100%; margin-bottom: 0.08in"><b>2</b>text
</p>
<p align="left" style="line-height: 120%; margin-left: 0.3in; text-indent: -0.3in; margin-bottom: 0.08in">
<b>3 </b>text
</p>
<p style="line-height: 100%; margin-bottom: 0.08in">text <b>4 </b>text
<b>5 </b>text<b>6 </b>text
</p>
The html is a bit of a mess being converted from a docx file, but all I need to do is parse each piece of text following the bold numbers <b>#</b>
into its own row:
Chapter | Number | Text |
---|---|---|
1 | 1 | text |
1 | 2 | text |
1 | 3 | text |
Perhaps I need to make a tag for <b>#</b>
as a delineation?
I tried using BeautifulSoup find_all but this only returns strings between tags, and I need a way to return the text following a set of tags.
CodePudding user response:
Based on your example you could select all <b>
elements and check if the text isnumeric()
- Use find_previous()
and next_sibling
to select necessary information from left and right:
for e in soup.select('b'):
if e.get_text(strip=True).isnumeric():
data.append({
'chapter': e.find_previous('h2').get_text(strip=True).split()[-1],
'number': e.get_text(strip=True),
'text': e.next_sibling.strip() if e.next_sibling else None
})
Example
from bs4 import BeautifulSoup
html = '''
<h2 >CHAPTER 1</h2>
<p style="line-height: 100%; margin-bottom: 0.08in"><b>1</b>text
<p style="line-height: 100%; margin-bottom: 0.08in"><b>Header</b></p>
<p align="left" style="line-height: 100%; margin-bottom: 0.08in"><b>2</b>text
</p>
<p align="left" style="line-height: 120%; margin-left: 0.3in; text-indent: -0.3in; margin-bottom: 0.08in">
<b>3 </b>text
</p>
<p style="line-height: 100%; margin-bottom: 0.08in">text <b>4 </b>text
<b>5 </b>text<b>6 </b>text
</p>
'''
soup = BeautifulSoup(html)
data = []
for e in soup.select('b'):
if e.get_text(strip=True).isnumeric():
data.append({
'chapter': e.find_previous('h2').get_text(strip=True).split()[-1],
'number': e.get_text(strip=True),
'text': e.next_sibling.strip() if e.next_sibling else None
})
pd.DataFrame(data)
Output
chapter | number | text | |
---|---|---|---|
0 | 1 | 1 | text |
1 | 1 | 2 | text |
2 | 1 | 3 | text |
3 | 1 | 4 | text |
4 | 1 | 5 | text |
5 | 1 | 6 | text |