Home > Back-end >  Web scraping two separate sites
Web scraping two separate sites

Time:08-29

I'm trying to scrape names, emails and titles from both the Clemson and Ohio State Athletics Department website. On first glance, the HTML is formatted both in the same ways, which leads me to believe the method I use to scrape one, should work for the other.

https://clemsontigers.com/staff-directory/ https://ohiostatebuckeyes.com/staff-directory/

But this doesn't seem to be the case, I'm able to get names on both websites, and emails from Ohio State, but not from Clemson. Also unsure how to get the titles (i.e 'Director of Athletics') from both.

Thoughts on how I can go about solving this? Thanks in advance!

from bs4 import BeautifulSoup
import requests
import re
import selenium

urls = ''

with open('websites.txt', 'r') as f:
    for line in f.read():
        urls  = line

urls = list(urls.split())

for url in urls:

    print(f'CURRENTLY PARSING: {url}')
    print()

    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'html.parser')

    try
        for information in soup.find_all('tr'):
            names = information.td.text.strip()
            positions = ???
            emails = information.find('a', href=re.compile('mailto:')).text.strip()
            print(names)
            print(positions)
            print(emails)
            print()


    except Exception as e:
      pass
     

CodePudding user response:

Main issue is that the email is not available as text in both structures - So better use the href attribute:

information.select_one('a[href^="mailto:"]').get('href').split(':')[-1]

and also try to check if information / email is available:

emails = information.select_one('a[href^="mailto:"]').get('href').split(':')[-1] if information.select_one('a[href^="mailto:"]') else None

or with walrus operator from python 3.8:

emails = e.get('href').split(':')[-1] if (e:=  information.select_one('a[href^="mailto:"]')) else None

Note: Be aware, there is no one fits all solution and you have always to inspect the ressources you like to scrape.

Example

from bs4 import BeautifulSoup
import requests

urls = ['https://clemsontigers.com/staff-directory/','https://ohiostatebuckeyes.com/staff-directory/']

for url in urls:

    print(f'CURRENTLY PARSING: {url}')

    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'html.parser')

    for information in soup.find_all('tr'):
        names = information.td.text.strip()
        emails = information.select_one('a[href^="mailto:"]').get('href').split(':')[-1] if information.select_one('a[href^="mailto:"]') else None
        print(names, emails)
  • Related