How would I be able to find the class text "New York City, New York, USA" in Python
with BeautifulSoup
?
Was trying to replicate a video for practicing but it doesnt work anymore.
Tried finding something in the official documentation but didnt get it to work. Or is my get_html_content function not working properly and Google just blocks me thus returning an empty list
/ None
?
Here's my current code:
from django.shortcuts import render
import requests
def get_html_content(city):
USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
LANGUAGE = "en-US,en;q=0.5"
session = requests.Session()
session.headers['User-Agent'] = USER_AGENT
session.headers['Accept-Language'] = LANGUAGE
session.headers['Content-Language'] = LANGUAGE
city.replace(" ", " ")
html_content = session.get(f"https://www.google.com/search?q=weather in {city}").text
return html_content
def home(request):
result = None
if 'city' in request.GET:
city = request.GET.get('city')
html_content = get_html_content(city)
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
soup.find_all('div', attrs={'class': 'wob_loc q8U8x'})
**OR**
soup.find_all('div', attrs={'id': 'wob_loc'})
--> both return empty lists (= .find
method returns None
)
CodePudding user response:
The layout of the google page has probable changed in the meantime, so to get data about the weather you have to change your code. For example:
import requests
from bs4 import BeautifulSoup
params = {'q':'weather in New York City, New York, USA', 'hl': 'en'}
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0'}
cookies = {'CONSENT':"YES cb.20220419-08-p0.cs FX 111"}
url = 'https://www.google.com/search'
soup = BeautifulSoup(requests.get(url, params=params, headers=headers, cookies=cookies).content, 'html.parser')
for t in soup.select('#wob_dp [aria-label]'):
how = t.find_next('img')['alt']
temp = t.find_next('span').get_text(strip=True)
print('{:<5} {:<20} {}'.format(t.text, how, temp))
Prints:
Mon Sunny 8
Tue Cloudy 7
Wed Partly cloudy 11
Thu Rain 7
Fri Mostly cloudy 8
Sat Partly cloudy 6
Sun Scattered showers 8
Mon Showers 8
CodePudding user response:
The way to select the elements is not necessarily wrong, it would work as long as the soup
is based on the rendered version of the web page, which you can get for example by using selenium
.
Your approach, however, uses requests
which cannot render the dynamic content in the same way as a browser would, so instead you have to make do with the static content.
First of all, you should also send a consent cookie
with your request so that you can access the desired content (This means that your assumption is already going in the right direction):
cookies={'CONSENT':'YES '}
and then you should use the HTML structure to select the elements.
Example (requests
)
import requests
from bs4 import BeautifulSoup
def get_html_content(city):
USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
LANGUAGE = "en-US,en;q=0.5"
session = requests.Session()
session.headers['User-Agent'] = USER_AGENT
session.headers['Accept-Language'] = LANGUAGE
session.headers['Content-Language'] = LANGUAGE
city.replace(" ", " ")
html_content = session.get(f"https://www.google.com/search?q=weather in {city}", cookies={'CONSENT':'YES '}).text
return html_content
html_content = get_html_content('New York City, New York, USA')
soup = BeautifulSoup(html_content)
soup.select_one('div>span>span')
Example (selenium
)
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
url = 'https://www.google.com/search?q=weather inNew York City, New York, USA'
driver.get(url)
soup = BeautifulSoup(driver.page_source)
soup.find('div', attrs={'id': 'wob_loc'})