Home > Enterprise >  Google scrape returns no description or email
Google scrape returns no description or email

Time:04-01

I'm trying to get a description and email from each of Google searches, but it returns only titles and links. I'm using Selenium to open pages and bs4 to scrape the actual content.

What am I doing wrong? Please help. Thanks!

soup = BeautifulSoup(driver.page_source,'lxml')
result_div = soup.find_all('div', attrs={'class': 'g'})


links = []
titles = []
descriptions = []
emails = []
phones = []

for r in result_div:
# Checks if each element is present, else, raise exception
    try:
    # link
        link = r.find('a', href=True)

    # title
        title = None
        title = r.find('h3')

        if isinstance(title,Tag):
            title = title.get_text()

    # desc
        description = None
        description = r.find('div', attrs={'class': 'IsZvec'})
        #description = r.find('span')
    

        if isinstance(description, Tag):
            description = description.get_text()
            print(description)
    # email

        email = r.find(text=re.compile(r'[A-Za-z0-9\.\ _-] @[A-Za-z0-9\._-] \.[a-zA-Z]*'))

CodePudding user response:

Main issue here is that the class names are dynamic, so you have to change your strategy and select your elements by tag or id.

...
data = []

for e in soup.select('div:has(> div > a h3)'):
    data.append({
        'title':e.h3.text,
        'url':e.a.get('href'),
        'desc':e.next_sibling.text,
        'email':m.group(0) if (m:= re.search(r'[\w. -] @[\w-] \.[\w.-] ', e.parent.text)) else None
    })
    
data

Output

[{'title': 'Email design at Stack Overflow',
  'url': 'https://stackoverflow.design/email/guidelines/getting-started/',
  'desc': 'An email design system that helps us work together to create consistently-designed, properly-rendered email for all Stack Overflow users.',
  'email': None},
 {'title': 'Is email from [email protected] legit? - Meta ...',
  'url': 'https://meta.stackoverflow.com/questions/338332/is-email-from-do-not-replystackoverflow-email-legit',
  'desc': '23.11.2016 · 1\xa0AntwortYes it is legit. We use it to protect stackoverflow.com user cookies from third parties. The links in the email are all rewritten to a\xa0...',
  'email': '[email protected]'},
 {'title': "Newest 'email' Questions - Stack Overflow",
  'url': 'https://stackoverflow.com/questions/tagged/email',
  'desc': 'Use this tag for questions involving code to send or receive email messages. Posting to ask why the emails you send are marked as spam is off-topic for Stack\xa0...',
  'email': None},
 {'title': 'Contact information - contact us today - Stack Overflow',
  'url': 'https://stackoverflow.co/company/contact',
  'desc': "A private, secure home for your team's questions and answers. Perfect for teams of 10-500 members. No more digging through stale wikis and lost emails—give your\xa0...",
  'email': None},
 {'title': 'How can I get the email of a stackoverflow user? - Meta Stack ...',
  'url': 'https://meta.stackexchange.com/questions/64970/how-can-i-get-the-email-of-a-stackoverflow-user',
  'desc': '18.09.2010 · 1\xa0AntwortYou can\'t. Read your own profile. The e-mail box says "never displayed". The closest we have to private messaging is commenting as a reply\xa0...',
  'email': None},...]
  • Related