Home > Mobile >  Parse tables under h1 tag with BeautifulSoup and store in df
Parse tables under h1 tag with BeautifulSoup and store in df

Time:09-26

I would like to extract tables from multiple websites under a h1 'Tables' header. Each website has multiple h1 headers, but the 'Tables' one is consistent across all sites. Each table also comes with an h2 header which differs from site to site although I have managed to extract them into a list.

My current code parses all tables, even the ones that come before the h1 'Tables' header. How can I exclude those tables? The html is similar to the one below:

<h2>I don't care about this table</h2>
<table class="foo">
  <tr>
    <td>Key A</td>
  </tr>
  <tr>
    <td>A value I don't want</td>
  </tr>
</table>

<h1>Tables</h1>
<p> A description I don't care about </p>
<h2>First good table</h2>
<table class="foo">
  <tr>
    <td>Key B</td>
  </tr>
  <tr>
    <td>A value I want</td>
  </tr>
</table>


<h2>Second good table</h2>
<table class="foo">
  <tr>
    <td>Key C</td>
  </tr>
  <tr>
    <td>A value I want</td>
  </tr>
</table>

My current approach:

soup = BeautifulSoup(self.body, features="lxml")
headers = [tags.text for tags in soup.find_all(["h1", "h2"])]

try:
    # Find h2 table headers under h1 'Tables' header
    target_index = headers.index("Tables")
    table_headers = headers[target_index   1 :]

except ValueError:
    print("Page doesn't contain tables")

# This includes all tables. How do we make sure we only include those under the 'Tables' header?
tables_raw = [[[cell.text for cell in row("th")   row("td")] for row in table("tr")]for table in soup("table")]

# Create dfs and assign a name
tables_df = [pd.DataFrame(table) for table in tables_raw]
tables_and_names = list(zip(table_headers, tables_df))

I did have a look this solution, but can't figure out how to get the df output that I currently have. Any help would be appreciated.

CodePudding user response:

You can use CSS selector with ~:

for table in soup.select("h1:-soup-contains(Tables) ~ table"):
    for row in table.select("tr"):
        print(*[td.get_text(strip=True) for td in row.select("td")])

Prints:

Key B
A value I want
Key C
A value I want

Or without CSS selectors:

h1 = soup.find("h1", text="Tables")
for table in h1.find_next_siblings("table"):
    for row in table.select("tr"):
        print(*[td.get_text(strip=True) for td in row.select("td")])

CodePudding user response:

this should do

header = soup.find('h1')
for sibling in header.next_siblings:
    if sibling.name == 'table':
        do_stuff()
  • Related