from bs4 import BeautifulSoup
import requests
page = requests.get('https://www.capitol.tn.gov/house/members/').text
soup = BeautifulSoup(page, 'html.parser')
table = soup.find('table')
rows = table.find_all('tr')
header = rows[0].find_all('th')
header_text = []
for item in header:
header_text.append(item.get_text(strip=True))
# check header results
print(header_text)
# get rows
for row in rows:
row_text = []
a = row.find_all('a')
td = row.find_all('td')
for item in td:
if item:
row_text.append(item.get_text(strip=True))
# check row results
if len(row_text) > 0:
print(row_text)
I'm sorry if this is a stupid question, but I'm having a bit of trouble coming up with how to get the 'a's or 'hrefs' (aka the emails) to actually appear as the first item in the row. For starters, I've tried the insert() method, but it never actually gives me anything.
CodePudding user response:
This does the job:
# get rows
for row in rows:
row_text = []
a = row.find_all('a')
td = row.find_all('td')
# print(td)
for item in td:
email = item.find("a", {"class": "email"})
if email != None:
email = email.get("href")
row_text.append(email)
if item:
row_text.append(item.get_text(strip=True))
# check row results
if len(row_text) > 0:
print(row_text)
The code basically checks if any element in a td
tag has an a
tag in it. If it finds an a
tag, it checks if the tag belong so the class email
. If it does then it gets the href
from the tag and stores it inside a variable by the name email
which is later appended to the row_text
list.