Home > Software design >  Scraping with Beautiful Soup does not update values properly
Scraping with Beautiful Soup does not update values properly

Time:09-27

I try to web-scrape weather website but the data does not update properly. The code:

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'https://www.wunderground.com/dashboard/pws/KORPISTO1'

while True:
    soup = BeautifulSoup(urlopen(url), 'html.parser')
    data = soup.find("div", {"class": "weather__text"})
    print(data.text)

I am looking at 'WIND & WIND GUST' in 'CURRENT CONDITIONS' section. It prints the first values correctly (for example 1.0 / 2.2 mph) but after that the values update very slowly (at times 5 minutes pass by) even though they change every 10-20-30 seconds in the website.

And when the values update in Python they are still different from the current values in the website.

CodePudding user response:

try:

import requests
from bs4 import BeautifulSoup

url = 'https://www.wunderground.com/dashboard/pws/KORPISTO1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
session = requests.Session()
r = session.get(url, timeout=30, headers=headers)     # print(r.status_code)
soup = BeautifulSoup(r.content, 'html.parser')

#'WIND & WIND GUST' in 'CURRENT CONDITIONS' section
wind_gust = [float(i.text) for i in soup.select_one('.weather__header:-soup-contains("WIND & GUST")').find_next('div', class_='weather__text').select('span.wu-value-to')]

print(wind_gust)
[1.8, 2.2]

wind = wind_gust[0]
gust = wind_gust[1]

print(wind)
1.8

print(gust)
2.2

CodePudding user response:

Try the next example:

from bs4 import BeautifulSoup 
import requests

url= 'https://www.wunderground.com/dashboard/pws/KORPISTO1'
req = requests.get(url)

soup = BeautifulSoup(req.text,'lxml')


d = {soup.select('div[] div')[0].text:soup.select('div[] div')[1].text.replace('\xa0°','')}
print(d)

Output:

{' WIND & GUST ': '1.1 / 2.2mph'}

Selenium and bs4 to grab dynamic value:

import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())

url= 'https://www.wunderground.com/dashboard/pws/KORPISTO1'
driver.get(url)
driver.maximize_window()
time.sleep(3)

page = driver.page_source
soup = BeautifulSoup(page, 'lxml')

d = {soup.select('div[] div')[0].text:soup.select('div[] div')[1].text.replace('\xa0°','')}
print(d)

To get the update value, you can use selenium with bs4 because update value is dynamically loaded by JS and to render the dynamic content you can use selenium with bs4.

Output:

{' WIND & GUST ': '1.3 / 2.2mph'}
  • Related