Home > Software design >  How to get all elements of nested tags in BeautifulSoup?
How to get all elements of nested tags in BeautifulSoup?

Time:09-30

I'm pretty new to HTML and parsing so I apologize if I use the wrong terminology for this. I asked a similar question earlier and found some helpful answers. I have the following HTML snippet consisting of two tables and two table headers (and more rows but not relevant to this post)

<body>
    <table>
        <tr class="header">
            <th><strong>Heading 1</strong></th>
            <th><strong>Heading 2</strong></th>
            <th><strong>Heading 3</strong></th>
            <th><p><strong>Heading 4, line 1</strong></p>
            <p><strong>Heading 4, line 2</strong></p></th>
        </tr>

        <tr>
            <!--Many more rows-->>
        </tr>
    </table>

    <table>
        <tr class="header">
            <th><strong>Diff Header 1</strong></th>
            <th><strong>Diff Header 2</strong></th>
            <th><strong>Diff Header 3</strong></th>
            <th><p><strong>Diff Header 4, line 1</strong></p>
            <p><strong>Diff Header 4, line 2</strong></p></th>
        </tr>
    
        <tr>
            <!--Many more rows-->>
        </tr>
    </table>
</body>

I am trying to use python3.6 and BeautifulSoup4 to parse this and extract the text into a a list. My problem is, I want there to be separate lists for each block. My current code seems to search through and find ALL of the <th> tags instead of the ones in the first table.

Heres what I have:

def parse_html(self):
    """ Parse the html file """
    with open(self.html_path) as f:
        soup = BeautifulSoup(f, 'html.parser')

    tables = soup.find_all('table')
    
    for table in tables:
        # Find each row in the table
        rows = table.find_all_next('tr')
        for row in rows:
            # Find each column in the row
            cols = row.find_all_next('th')
            for col in cols:
                # Print each cell
                print(col) # This is where it seems to be finding every <th>

            break          # Break just to do the first row (seems not to work?)

Question: How can I modify this code so that it only finds the <th> tags in the current row and not every row?

Thank you for any help!

CodePudding user response:

Use .find_all instead of .find_all_next.

If html_doc is your HTML snippet from the question:

soup = BeautifulSoup(html_doc, "html.parser")

tables = soup.find_all("table")

for table in tables:
    # Find each row in the table
    rows = table.find_all("tr")
    for row in rows:
        cols = row.find_all("th")
        for col in cols:
            print(col)

    print("-" * 80)

Prints:

<th><strong>Heading 1</strong></th>
<th><strong>Heading 2</strong></th>
<th><strong>Heading 3</strong></th>
<th><p><strong>Heading 4, line 1</strong></p>
<p><strong>Heading 4, line 2</strong></p></th>
--------------------------------------------------------------------------------
<th><strong>Diff Header 1</strong></th>
<th><strong>Diff Header 2</strong></th>
<th><strong>Diff Header 3</strong></th>
<th><p><strong>Diff Header 4, line 1</strong></p>
<p><strong>Diff Header 4, line 2</strong></p></th>
--------------------------------------------------------------------------------
  • Related