new to web scraping (using python) and encountered a problem trying to get an email from a university's athletic department site.
I've managed to get to navigate to the email I want to extract but don't know where to go from here. When I print what I have, all I get is '' and not the actual text of the email.
I'm attaching what I have so far, let me know if it needs a better explanation.
Here's a link to an image of what I'm trying to scrape. Website and the website: https://goheels.com/staff-directory
Thanks!
Here's my code:
from bs4 import BeautifulSoup
import requests
urls = ''
with open('websites.txt', 'r') as f:
for line in f.read():
urls = line
urls = list(urls.split())
print(urls)
for url in urls:
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
try:
body = soup.find(headers="col-staff_email category-0")
links = body.a
print(links)
except Exception as e:
print(f'"This url didn\'t work:" {url}')
CodePudding user response:
The emails are hidden inside a <script>
element. With a little pushing, shoving, css selecting and string splitting you can get there:
for em in soup.select('td[headers*="col-staff_email"] script'):
target = em.text.split('var firstHalf = "')[1]
fh = target.split('";')[0]
lh = target.split('var secondHalf = "')[1].split('";')[0]
print(fh '@' lh)
Output:
[email protected]
[email protected]
[email protected]
[email protected]
etc.