I have been trying to pull datatables from this website and cannot seem to get the table: https://www.wunderground.com/history/daily/us/nv/north-las-vegas/KVGT/date/2021-8-26
I first tried calling pd.read_html(url), where the url variable is the link above. This returns no tables error.
I then tried to access the website using urllib3 and parsing with bs4, like so:
import urllib3
from bs4 import BeautifulSoup
url = 'https://www.wunderground.com/history/daily/us/nv/north-las-vegas/KVGT/date/2021-8-26'
http = urllib3.PoolManager()
r = http.request('GET', url)
soup = BeautifulSoup(r.data)
list_of_tables = soup.find_all('table')
where list_of_tables returns an empty list. Can anyone help me retrieve the table with all the hourly weather data as I am not sure where to go from here.
CodePudding user response:
Information in that page is loaded dynamically, from an API. You can inspect the Network tab in Dev tools, to inspect the network calls. One way of getting a table from that page would be:
import requests
import pandas as pd
r = requests.get('https://api.weather.com/v1/location/KVGT:9:US/observations/historical.json?apiKey=e1f10a1e78da46f5b10a1e78da96f525&units=e&startDate=20210826&endDate=20210826')
df = pd.DataFrame(r.json()['observations'])
df
This returns a dataframe with historical data:
key class expire_time_gmt obs_id obs_name valid_time_gmt day_ind temp wx_icon icon_extd wx_phrase pressure_tend pressure_desc dewPt heat_index rh pressure vis wc wdir wdir_cardinal gust wspd max_temp min_temp precip_total precip_hrly snow_hrly uv_desc feels_like uv_index qualifier qualifier_svrty blunt_phrase terse_phrase clds water_temp primary_wave_period primary_wave_height primary_swell_period primary_swell_height primary_swell_direction secondary_swell_period secondary_swell_height secondary_swell_direction
0 KVGT observation 1629971580 KVGT Las Vegas 1629964380 N 93 33 3300 Fair NaN None 40 89 16 27.56 10 93 190.0 S NaN 12 107.0 76.0 None 0 None Low 89 0 None None None None CLR None None None None None None None None None
1 KVGT observation 1629975180 KVGT Las Vegas 1629967980 N 93 33 3300 Fair 2.0 Falling Rapidly 41 89 16 27.55 10 93 260.0 W NaN 8 NaN NaN None 0 None Low 89 0 None None None None CLR None None None None None None None None None
2 KVGT observation 1629978780 KVGT Las Vegas 1629971580 N 90 33 3300 Fair NaN None 43 86 19 27.55 10 90 210.0 SSW NaN 6 NaN NaN None 0 None Low 86 0 None None None None CLR None None None None None None None None None
3 KVGT observation 1629982380 KVGT Las Vegas 1629975180 N 86 33 3300 Fair NaN None 41 83 20 27.56 10 86 310.0 NW NaN 3 NaN NaN None 0 None Low 83 0 None None None None CLR None None None None None None None None None
[....]
For daily observations data, the url you would need to scrape is https://api.weather.com/v1/location/KVGT:9:US/almanac/daily.json?apiKey=e1f10a1e78da46f5b10a1e78da96f525&units=e&start=0826
You can install requests with pip install requests
, and pandas with pip install pandas
CodePudding user response:
The webpage contains two tables
and the hourly weather data table
number is 2. As the webpage is dynamic, so bs4 or only pandas can't render JS. That's why you can use an automation tool something like selenium, Here I use selenium along with pandas to grab that dynamic table data/
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.relative_locator import locate_with
import pandas as pd
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
# chrome_options.add_argument("--headless")
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
driver.get('https://www.wunderground.com/history/daily/us/nv/north-las-vegas/KVGT/date/2021-8-26')
driver.maximize_window()
table = WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH, '(//table)[2]'))).get_attribute("outerHTML")
df = pd.read_html(table)[0].dropna(how='all')
print(df)
Output:
Time Temperature Dew Point Humidity ... Wind Gust Pressure Precip. Condition
0 12:53 AM 93 °F 40 °F 16 °% ... 0 °mph 27.56 °in 0.0 °in Fair
1 1:53 AM 93 °F 41 °F 16 °% ... 0 °mph 27.55 °in 0.0 °in Fair
2 2:53 AM 90 °F 43 °F 19 °% ... 0 °mph 27.55 °in 0.0 °in Fair
3 3:53 AM 86 °F 41 °F 20 °% ... 0 °mph 27.56 °in 0.0 °in Fair
4 4:53 AM 83 °F 42 °F 24 °% ... 0 °mph 27.56 °in 0.0 °in Fair
5 5:53 AM 81 °F 43 °F 26 °% ... 0 °mph 27.57 °in 0.0 °in Fair
6 6:53 AM 82 °F 42 °F 24 °% ... 0 °mph 27.59 °in 0.0 °in Fair
7 7:53 AM 86 °F 43 °F 22 °% ... 0 °mph 27.59 °in 0.0 °in Fair
8 8:53 AM 90 °F 41 °F 18 °% ... 0 °mph 27.59 °in 0.0 °in Fair
9 9:53 AM 94 °F 41 °F 16 °% ... 0 °mph 27.60 °in 0.0 °in Fair
10 10:53 AM 97 °F 38 °F 13 °% ... 0 °mph 27.58 °in 0.0 °in Fair
11 11:53 AM 102 °F 17 °F 5 °% ... 0 °mph 27.56 °in 0.0 °in Fair
12 12:53 PM 104 °F 13 °F 4 °% ... 21 °mph 27.55 °in 0.0 °in Fair
13 1:53 PM 106 °F 10 °F 3 °% ... 23 °mph 27.53 °in 0.0 °in Fair
14 2:53 PM 106 °F 0 °F 2 °% ... 0 °mph 27.50 °in 0.0 °in Fair
15 3:53 PM 107 °F 3 °F 2 °% ... 24 °mph 27.47 °in 0.0 °in Fair
16 4:53 PM 106 °F -3 °F 2 °% ... 25 °mph 27.46 °in 0.0 °in Fair
17 5:53 PM 105 °F -5 °F 2 °% ... 25 °mph 27.45 °in 0.0 °in Fair
18 6:53 PM 102 °F -5 °F 2 °% ... 0 °mph 27.45 °in 0.0 °in Fair
19 7:53 PM 99 °F 0 °F 2 °% ... 0 °mph 27.47 °in 0.0 °in Fair
20 8:53 PM 97 °F 6 °F 3 °% ... 0 °mph 27.49 °in 0.0 °in Fair
21 9:53 PM 95 °F 9 °F 4 °% ... 0 °mph 27.50 °in 0.0 °in Fair
22 10:53 PM 90 °F 13 °F 6 °% ... 0 °mph 27.50 °in 0.0 °in Fair
23 11:53 PM 87 °F 18 °F 8 °% ... 0 °mph 27.49 °in 0.0 °in Fair
[24 rows x 10 columns]