How to scrape specific google weather text with BeautifulSoup?-CodePudding

How would I be able to find the class text "New York City, New York, USA" in Python with BeautifulSoup?

Was trying to replicate a video for practicing but it doesnt work anymore.

Tried finding something in the official documentation but didnt get it to work. Or is my get_html_content function not working properly and Google just blocks me thus returning an empty list / None ?

Here's my current code:

from django.shortcuts import render
import requests

def get_html_content(city):
    USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
    LANGUAGE = "en-US,en;q=0.5"
    session = requests.Session()
    session.headers['User-Agent'] = USER_AGENT
    session.headers['Accept-Language'] = LANGUAGE
    session.headers['Content-Language'] = LANGUAGE
    city.replace(" ", " ")
    html_content = session.get(f"https://www.google.com/search?q=weather in {city}").text
    return html_content

def home(request):
    result = None
    if 'city' in request.GET: 
        city = request.GET.get('city')
        html_content = get_html_content(city)
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html_content, 'html.parser')
        soup.find_all('div', attrs={'class': 'wob_loc q8U8x'})
        **OR**
        soup.find_all('div', attrs={'id': 'wob_loc'})

--> both return empty lists (= .find method returns None)

CodePudding user response：

The layout of the google page has probable changed in the meantime, so to get data about the weather you have to change your code. For example:

import requests
from bs4 import BeautifulSoup


params = {'q':'weather in New York City, New York, USA', 'hl': 'en'}
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0'}
cookies = {'CONSENT':"YES cb.20220419-08-p0.cs FX 111"}

url = 'https://www.google.com/search'


soup = BeautifulSoup(requests.get(url, params=params, headers=headers, cookies=cookies).content, 'html.parser')

for t in soup.select('#wob_dp [aria-label]'):
    how = t.find_next('img')['alt']
    temp = t.find_next('span').get_text(strip=True)
    print('{:<5} {:<20} {}'.format(t.text, how, temp))

Prints:

Mon   Sunny                8
Tue   Cloudy               7
Wed   Partly cloudy        11
Thu   Rain                 7
Fri   Mostly cloudy        8
Sat   Partly cloudy        6
Sun   Scattered showers    8
Mon   Showers              8

CodePudding user response：

The way to select the elements is not necessarily wrong, it would work as long as the soup is based on the rendered version of the web page, which you can get for example by using selenium.

Your approach, however, uses requests which cannot render the dynamic content in the same way as a browser would, so instead you have to make do with the static content.

First of all, you should also send a consent cookie with your request so that you can access the desired content (This means that your assumption is already going in the right direction):

cookies={'CONSENT':'YES '}

and then you should use the HTML structure to select the elements.

Example (`requests`)

import requests
from bs4 import BeautifulSoup

def get_html_content(city):
    USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
    LANGUAGE = "en-US,en;q=0.5"
    session = requests.Session()
    session.headers['User-Agent'] = USER_AGENT
    session.headers['Accept-Language'] = LANGUAGE
    session.headers['Content-Language'] = LANGUAGE
    city.replace(" ", " ")
    html_content = session.get(f"https://www.google.com/search?q=weather in {city}", cookies={'CONSENT':'YES '}).text
    return html_content

html_content = get_html_content('New York City, New York, USA')
soup = BeautifulSoup(html_content)

soup.select_one('div>span>span')

Example (`selenium`)

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

url = 'https://www.google.com/search?q=weather inNew York City, New York, USA'
driver.get(url)

soup = BeautifulSoup(driver.page_source)

soup.find('div', attrs={'id': 'wob_loc'})

Example (requests)

Example (selenium)

Example (`requests`)

Example (`selenium`)