I'm trying to get a description and email from each of Google searches, but it returns only titles and links. I'm using Selenium to open pages and bs4 to scrape the actual content.
What am I doing wrong? Please help. Thanks!
soup = BeautifulSoup(driver.page_source,'lxml')
result_div = soup.find_all('div', attrs={'class': 'g'})
links = []
titles = []
descriptions = []
emails = []
phones = []
for r in result_div:
# Checks if each element is present, else, raise exception
try:
# link
link = r.find('a', href=True)
# title
title = None
title = r.find('h3')
if isinstance(title,Tag):
title = title.get_text()
# desc
description = None
description = r.find('div', attrs={'class': 'IsZvec'})
#description = r.find('span')
if isinstance(description, Tag):
description = description.get_text()
print(description)
# email
email = r.find(text=re.compile(r'[A-Za-z0-9\.\ _-] @[A-Za-z0-9\._-] \.[a-zA-Z]*'))
CodePudding user response:
Main issue here is that the class names are dynamic, so you have to change your strategy and select your elements by tag
or id
.
...
data = []
for e in soup.select('div:has(> div > a h3)'):
data.append({
'title':e.h3.text,
'url':e.a.get('href'),
'desc':e.next_sibling.text,
'email':m.group(0) if (m:= re.search(r'[\w. -] @[\w-] \.[\w.-] ', e.parent.text)) else None
})
data
Output
[{'title': 'Email design at Stack Overflow',
'url': 'https://stackoverflow.design/email/guidelines/getting-started/',
'desc': 'An email design system that helps us work together to create consistently-designed, properly-rendered email for all Stack Overflow users.',
'email': None},
{'title': 'Is email from [email protected] legit? - Meta ...',
'url': 'https://meta.stackoverflow.com/questions/338332/is-email-from-do-not-replystackoverflow-email-legit',
'desc': '23.11.2016 · 1\xa0AntwortYes it is legit. We use it to protect stackoverflow.com user cookies from third parties. The links in the email are all rewritten to a\xa0...',
'email': '[email protected]'},
{'title': "Newest 'email' Questions - Stack Overflow",
'url': 'https://stackoverflow.com/questions/tagged/email',
'desc': 'Use this tag for questions involving code to send or receive email messages. Posting to ask why the emails you send are marked as spam is off-topic for Stack\xa0...',
'email': None},
{'title': 'Contact information - contact us today - Stack Overflow',
'url': 'https://stackoverflow.co/company/contact',
'desc': "A private, secure home for your team's questions and answers. Perfect for teams of 10-500 members. No more digging through stale wikis and lost emails—give your\xa0...",
'email': None},
{'title': 'How can I get the email of a stackoverflow user? - Meta Stack ...',
'url': 'https://meta.stackexchange.com/questions/64970/how-can-i-get-the-email-of-a-stackoverflow-user',
'desc': '18.09.2010 · 1\xa0AntwortYou can\'t. Read your own profile. The e-mail box says "never displayed". The closest we have to private messaging is commenting as a reply\xa0...',
'email': None},...]