I'm trying to scrape names, emails and titles from both the Clemson and Ohio State Athletics Department website. On first glance, the HTML is formatted both in the same ways, which leads me to believe the method I use to scrape one, should work for the other.
https://clemsontigers.com/staff-directory/ https://ohiostatebuckeyes.com/staff-directory/
But this doesn't seem to be the case, I'm able to get names on both websites, and emails from Ohio State, but not from Clemson. Also unsure how to get the titles (i.e 'Director of Athletics') from both.
Thoughts on how I can go about solving this? Thanks in advance!
from bs4 import BeautifulSoup
import requests
import re
import selenium
urls = ''
with open('websites.txt', 'r') as f:
for line in f.read():
urls = line
urls = list(urls.split())
for url in urls:
print(f'CURRENTLY PARSING: {url}')
print()
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
try
for information in soup.find_all('tr'):
names = information.td.text.strip()
positions = ???
emails = information.find('a', href=re.compile('mailto:')).text.strip()
print(names)
print(positions)
print(emails)
print()
except Exception as e:
pass
CodePudding user response:
Main issue is that the email is not available as text in both structures - So better use the href
attribute:
information.select_one('a[href^="mailto:"]').get('href').split(':')[-1]
and also try to check if information / email is available:
emails = information.select_one('a[href^="mailto:"]').get('href').split(':')[-1] if information.select_one('a[href^="mailto:"]') else None
or with walrus operator from python 3.8:
emails = e.get('href').split(':')[-1] if (e:= information.select_one('a[href^="mailto:"]')) else None
Note: Be aware, there is no one fits all solution and you have always to inspect the ressources you like to scrape.
Example
from bs4 import BeautifulSoup
import requests
urls = ['https://clemsontigers.com/staff-directory/','https://ohiostatebuckeyes.com/staff-directory/']
for url in urls:
print(f'CURRENTLY PARSING: {url}')
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
for information in soup.find_all('tr'):
names = information.td.text.strip()
emails = information.select_one('a[href^="mailto:"]').get('href').split(':')[-1] if information.select_one('a[href^="mailto:"]') else None
print(names, emails)