I'm pretty new to HTML and parsing so I apologize if I use the wrong terminology for this. I asked a similar question earlier and found some helpful answers. I have the following HTML snippet consisting of two tables and two table headers (and more rows but not relevant to this post)
<body>
<table>
<tr class="header">
<th><strong>Heading 1</strong></th>
<th><strong>Heading 2</strong></th>
<th><strong>Heading 3</strong></th>
<th><p><strong>Heading 4, line 1</strong></p>
<p><strong>Heading 4, line 2</strong></p></th>
</tr>
<tr>
<!--Many more rows-->>
</tr>
</table>
<table>
<tr class="header">
<th><strong>Diff Header 1</strong></th>
<th><strong>Diff Header 2</strong></th>
<th><strong>Diff Header 3</strong></th>
<th><p><strong>Diff Header 4, line 1</strong></p>
<p><strong>Diff Header 4, line 2</strong></p></th>
</tr>
<tr>
<!--Many more rows-->>
</tr>
</table>
</body>
I am trying to use python3.6 and BeautifulSoup4 to parse this and extract the text into a a list. My problem is, I want there to be separate lists for each block. My current code seems to search through and find ALL of the <th>
tags instead of the ones in the first table.
Heres what I have:
def parse_html(self):
""" Parse the html file """
with open(self.html_path) as f:
soup = BeautifulSoup(f, 'html.parser')
tables = soup.find_all('table')
for table in tables:
# Find each row in the table
rows = table.find_all_next('tr')
for row in rows:
# Find each column in the row
cols = row.find_all_next('th')
for col in cols:
# Print each cell
print(col) # This is where it seems to be finding every <th>
break # Break just to do the first row (seems not to work?)
Question: How can I modify this code so that it only finds the <th>
tags in the current row and not every row?
Thank you for any help!
CodePudding user response:
Use .find_all
instead of .find_all_next
.
If html_doc
is your HTML snippet from the question:
soup = BeautifulSoup(html_doc, "html.parser")
tables = soup.find_all("table")
for table in tables:
# Find each row in the table
rows = table.find_all("tr")
for row in rows:
cols = row.find_all("th")
for col in cols:
print(col)
print("-" * 80)
Prints:
<th><strong>Heading 1</strong></th>
<th><strong>Heading 2</strong></th>
<th><strong>Heading 3</strong></th>
<th><p><strong>Heading 4, line 1</strong></p>
<p><strong>Heading 4, line 2</strong></p></th>
--------------------------------------------------------------------------------
<th><strong>Diff Header 1</strong></th>
<th><strong>Diff Header 2</strong></th>
<th><strong>Diff Header 3</strong></th>
<th><p><strong>Diff Header 4, line 1</strong></p>
<p><strong>Diff Header 4, line 2</strong></p></th>
--------------------------------------------------------------------------------