I would like to extract tables from multiple websites under a h1 'Tables' header. Each website has multiple h1 headers, but the 'Tables' one is consistent across all sites. Each table also comes with an h2 header which differs from site to site although I have managed to extract them into a list.
My current code parses all tables, even the ones that come before the h1 'Tables' header. How can I exclude those tables? The html is similar to the one below:
<h2>I don't care about this table</h2>
<table class="foo">
<tr>
<td>Key A</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
<h1>Tables</h1>
<p> A description I don't care about </p>
<h2>First good table</h2>
<table class="foo">
<tr>
<td>Key B</td>
</tr>
<tr>
<td>A value I want</td>
</tr>
</table>
<h2>Second good table</h2>
<table class="foo">
<tr>
<td>Key C</td>
</tr>
<tr>
<td>A value I want</td>
</tr>
</table>
My current approach:
soup = BeautifulSoup(self.body, features="lxml")
headers = [tags.text for tags in soup.find_all(["h1", "h2"])]
try:
# Find h2 table headers under h1 'Tables' header
target_index = headers.index("Tables")
table_headers = headers[target_index 1 :]
except ValueError:
print("Page doesn't contain tables")
# This includes all tables. How do we make sure we only include those under the 'Tables' header?
tables_raw = [[[cell.text for cell in row("th") row("td")] for row in table("tr")]for table in soup("table")]
# Create dfs and assign a name
tables_df = [pd.DataFrame(table) for table in tables_raw]
tables_and_names = list(zip(table_headers, tables_df))
I did have a look this solution, but can't figure out how to get the df output that I currently have. Any help would be appreciated.
CodePudding user response:
You can use CSS selector with ~
:
for table in soup.select("h1:-soup-contains(Tables) ~ table"):
for row in table.select("tr"):
print(*[td.get_text(strip=True) for td in row.select("td")])
Prints:
Key B
A value I want
Key C
A value I want
Or without CSS selectors:
h1 = soup.find("h1", text="Tables")
for table in h1.find_next_siblings("table"):
for row in table.select("tr"):
print(*[td.get_text(strip=True) for td in row.select("td")])
CodePudding user response:
this should do
header = soup.find('h1')
for sibling in header.next_siblings:
if sibling.name == 'table':
do_stuff()