Home > Net >  How to iterate HTML file and parse specific data to Dataframe?
How to iterate HTML file and parse specific data to Dataframe?

Time:01-10

I have looked over various methods from BeautifulSoup to XML parsers and I think that there must be a simpler way to iterate over an HTML file to parse information into a dataframe table. There is a lot of information with specific section headers:

<h2 >CHAPTER 1</h2>
    <p  style="line-height: 100%; margin-bottom: 0.08in"><b>1</b>text
    <p  style="line-height: 100%; margin-bottom: 0.08in"><b>Header</b></p>
    <p align="left" style="line-height: 100%; margin-bottom: 0.08in"><b>2</b>text
    </p>
    <p align="left" style="line-height: 120%; margin-left: 0.3in; text-indent: -0.3in; margin-bottom: 0.08in">
    <b>3 </b>text
    </p>
    <p  style="line-height: 100%; margin-bottom: 0.08in">text <b>4 </b>text
    <b>5 </b>text<b>6 </b>text
    </p>

The html is a bit of a mess being converted from a docx file, but all I need to do is parse each piece of text following the bold numbers <b>#</b> into its own row:

Chapter Number Text
1 1 text
1 2 text
1 3 text

Perhaps I need to make a tag for <b>#</b> as a delineation?

I tried using BeautifulSoup find_all but this only returns strings between tags, and I need a way to return the text following a set of tags.

CodePudding user response:

Based on your example you could select all <b> elements and check if the text isnumeric() - Use find_previous() and next_sibling to select necessary information from left and right:

for e in soup.select('b'):
    if e.get_text(strip=True).isnumeric():
        data.append({
            'chapter': e.find_previous('h2').get_text(strip=True).split()[-1],
            'number': e.get_text(strip=True),
            'text': e.next_sibling.strip() if e.next_sibling else None
        })

Example

from bs4 import BeautifulSoup
html = '''
<h2 >CHAPTER 1</h2>
    <p  style="line-height: 100%; margin-bottom: 0.08in"><b>1</b>text
    <p  style="line-height: 100%; margin-bottom: 0.08in"><b>Header</b></p>
    <p align="left" style="line-height: 100%; margin-bottom: 0.08in"><b>2</b>text
    </p>
    <p align="left" style="line-height: 120%; margin-left: 0.3in; text-indent: -0.3in; margin-bottom: 0.08in">
    <b>3 </b>text
    </p>
    <p  style="line-height: 100%; margin-bottom: 0.08in">text <b>4 </b>text
    <b>5 </b>text<b>6 </b>text
    </p>
'''
soup = BeautifulSoup(html)

data = []

for e in soup.select('b'):
    if e.get_text(strip=True).isnumeric():
        data.append({
            'chapter': e.find_previous('h2').get_text(strip=True).split()[-1],
            'number': e.get_text(strip=True),
            'text': e.next_sibling.strip() if e.next_sibling else None
        })

pd.DataFrame(data)

Output

chapter number text
0 1 1 text
1 1 2 text
2 1 3 text
3 1 4 text
4 1 5 text
5 1 6 text
  • Related