How to iterate HTML file and parse specific data to Dataframe?-CodePudding

I have looked over various methods from BeautifulSoup to XML parsers and I think that there must be a simpler way to iterate over an HTML file to parse information into a dataframe table. There is a lot of information with specific section headers:

<h2 >CHAPTER 1</h2>
    <p  style="line-height: 100%; margin-bottom: 0.08in"><b>1</b>text
    <p  style="line-height: 100%; margin-bottom: 0.08in"><b>Header</b></p>
    <p align="left" style="line-height: 100%; margin-bottom: 0.08in"><b>2</b>text
    </p>
    <p align="left" style="line-height: 120%; margin-left: 0.3in; text-indent: -0.3in; margin-bottom: 0.08in">
    <b>3 </b>text
    </p>
    <p  style="line-height: 100%; margin-bottom: 0.08in">text <b>4 </b>text
    <b>5 </b>text<b>6 </b>text
    </p>

The html is a bit of a mess being converted from a docx file, but all I need to do is parse each piece of text following the bold numbers # into its own row:

Chapter	Number	Text
1	1	text
1	2	text
1	3	text

Perhaps I need to make a tag for # as a delineation?

I tried using BeautifulSoup find_all but this only returns strings between tags, and I need a way to return the text following a set of tags.

CodePudding user response：

Based on your example you could select all  elements and check if the text isnumeric() - Use find_previous() and next_sibling to select necessary information from left and right:

for e in soup.select('b'):
    if e.get_text(strip=True).isnumeric():
        data.append({
            'chapter': e.find_previous('h2').get_text(strip=True).split()[-1],
            'number': e.get_text(strip=True),
            'text': e.next_sibling.strip() if e.next_sibling else None
        })

Example

from bs4 import BeautifulSoup
html = '''
<h2 >CHAPTER 1</h2>
    <p  style="line-height: 100%; margin-bottom: 0.08in"><b>1</b>text
    <p  style="line-height: 100%; margin-bottom: 0.08in"><b>Header</b></p>
    <p align="left" style="line-height: 100%; margin-bottom: 0.08in"><b>2</b>text
    </p>
    <p align="left" style="line-height: 120%; margin-left: 0.3in; text-indent: -0.3in; margin-bottom: 0.08in">
    <b>3 </b>text
    </p>
    <p  style="line-height: 100%; margin-bottom: 0.08in">text <b>4 </b>text
    <b>5 </b>text<b>6 </b>text
    </p>
'''
soup = BeautifulSoup(html)

data = []

for e in soup.select('b'):
    if e.get_text(strip=True).isnumeric():
        data.append({
            'chapter': e.find_previous('h2').get_text(strip=True).split()[-1],
            'number': e.get_text(strip=True),
            'text': e.next_sibling.strip() if e.next_sibling else None
        })

pd.DataFrame(data)

Output

	chapter	number	text
0	1	1	text
1	1	2	text
2	1	3	text
3	1	4	text
4	1	5	text
5	1	6	text