Home > Net >  Not retrieving data tables properly with Python pandas / urllib3
Not retrieving data tables properly with Python pandas / urllib3


I have been trying to pull datatables from this website and cannot seem to get the table: https://www.wunderground.com/history/daily/us/nv/north-las-vegas/KVGT/date/2021-8-26

I first tried calling pd.read_html(url), where the url variable is the link above. This returns no tables error.

I then tried to access the website using urllib3 and parsing with bs4, like so:

import urllib3
from bs4 import BeautifulSoup

url = 'https://www.wunderground.com/history/daily/us/nv/north-las-vegas/KVGT/date/2021-8-26'
http = urllib3.PoolManager()
r = http.request('GET', url)
soup = BeautifulSoup(r.data)

list_of_tables = soup.find_all('table')

where list_of_tables returns an empty list. Can anyone help me retrieve the table with all the hourly weather data as I am not sure where to go from here.

CodePudding user response:

Information in that page is loaded dynamically, from an API. You can inspect the Network tab in Dev tools, to inspect the network calls. One way of getting a table from that page would be:

import requests
import pandas as pd

r = requests.get('https://api.weather.com/v1/location/KVGT:9:US/observations/historical.json?apiKey=e1f10a1e78da46f5b10a1e78da96f525&units=e&startDate=20210826&endDate=20210826')
df = pd.DataFrame(r.json()['observations'])

This returns a dataframe with historical data:

key class   expire_time_gmt obs_id  obs_name    valid_time_gmt  day_ind temp    wx_icon icon_extd   wx_phrase   pressure_tend   pressure_desc   dewPt   heat_index  rh  pressure    vis wc  wdir    wdir_cardinal   gust    wspd    max_temp    min_temp    precip_total    precip_hrly snow_hrly   uv_desc feels_like  uv_index    qualifier   qualifier_svrty blunt_phrase    terse_phrase    clds    water_temp  primary_wave_period primary_wave_height primary_swell_period    primary_swell_height    primary_swell_direction secondary_swell_period  secondary_swell_height  secondary_swell_direction
0   KVGT    observation 1629971580  KVGT    Las Vegas   1629964380  N   93  33  3300    Fair    NaN None    40  89  16  27.56   10  93  190.0   S   NaN 12  107.0   76.0    None    0   None    Low 89  0   None    None    None    None    CLR None    None    None    None    None    None    None    None    None
1   KVGT    observation 1629975180  KVGT    Las Vegas   1629967980  N   93  33  3300    Fair    2.0 Falling Rapidly 41  89  16  27.55   10  93  260.0   W   NaN 8   NaN NaN None    0   None    Low 89  0   None    None    None    None    CLR None    None    None    None    None    None    None    None    None
2   KVGT    observation 1629978780  KVGT    Las Vegas   1629971580  N   90  33  3300    Fair    NaN None    43  86  19  27.55   10  90  210.0   SSW NaN 6   NaN NaN None    0   None    Low 86  0   None    None    None    None    CLR None    None    None    None    None    None    None    None    None
3   KVGT    observation 1629982380  KVGT    Las Vegas   1629975180  N   86  33  3300    Fair    NaN None    41  83  20  27.56   10  86  310.0   NW  NaN 3   NaN NaN None    0   None    Low 83  0   None    None    None    None    CLR None    None    None    None    None    None    None    None    None

For daily observations data, the url you would need to scrape is https://api.weather.com/v1/location/KVGT:9:US/almanac/daily.json?apiKey=e1f10a1e78da46f5b10a1e78da96f525&units=e&start=0826

You can install requests with pip install requests, and pandas with pip install pandas

CodePudding user response:

The webpage contains two tables and the hourly weather data table number is 2. As the webpage is dynamic, so bs4 or only pandas can't render JS. That's why you can use an automation tool something like selenium, Here I use selenium along with pandas to grab that dynamic table data/

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.relative_locator import locate_with
import pandas as pd

chrome_options = Options()
# chrome_options.add_argument("--headless")

webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)

table = WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH, '(//table)[2]'))).get_attribute("outerHTML")

df = pd.read_html(table)[0].dropna(how='all')


     Time Temperature Dew Point Humidity  ... Wind Gust   Pressure  Precip. Condition
0   12:53 AM       93 °F     40 °F    16 °%  ...    0 °mph  27.56 °in  0.0 °in      Fair      
1    1:53 AM       93 °F     41 °F    16 °%  ...    0 °mph  27.55 °in  0.0 °in      Fair      
2    2:53 AM       90 °F     43 °F    19 °%  ...    0 °mph  27.55 °in  0.0 °in      Fair      
3    3:53 AM       86 °F     41 °F    20 °%  ...    0 °mph  27.56 °in  0.0 °in      Fair      
4    4:53 AM       83 °F     42 °F    24 °%  ...    0 °mph  27.56 °in  0.0 °in      Fair      
5    5:53 AM       81 °F     43 °F    26 °%  ...    0 °mph  27.57 °in  0.0 °in      Fair      
6    6:53 AM       82 °F     42 °F    24 °%  ...    0 °mph  27.59 °in  0.0 °in      Fair      
7    7:53 AM       86 °F     43 °F    22 °%  ...    0 °mph  27.59 °in  0.0 °in      Fair      
8    8:53 AM       90 °F     41 °F    18 °%  ...    0 °mph  27.59 °in  0.0 °in      Fair      
9    9:53 AM       94 °F     41 °F    16 °%  ...    0 °mph  27.60 °in  0.0 °in      Fair      
10  10:53 AM       97 °F     38 °F    13 °%  ...    0 °mph  27.58 °in  0.0 °in      Fair      
11  11:53 AM      102 °F     17 °F     5 °%  ...    0 °mph  27.56 °in  0.0 °in      Fair      
12  12:53 PM      104 °F     13 °F     4 °%  ...   21 °mph  27.55 °in  0.0 °in      Fair      
13   1:53 PM      106 °F     10 °F     3 °%  ...   23 °mph  27.53 °in  0.0 °in      Fair      
14   2:53 PM      106 °F      0 °F     2 °%  ...    0 °mph  27.50 °in  0.0 °in      Fair      
15   3:53 PM      107 °F      3 °F     2 °%  ...   24 °mph  27.47 °in  0.0 °in      Fair      
16   4:53 PM      106 °F     -3 °F     2 °%  ...   25 °mph  27.46 °in  0.0 °in      Fair      
17   5:53 PM      105 °F     -5 °F     2 °%  ...   25 °mph  27.45 °in  0.0 °in      Fair      
18   6:53 PM      102 °F     -5 °F     2 °%  ...    0 °mph  27.45 °in  0.0 °in      Fair      
19   7:53 PM       99 °F      0 °F     2 °%  ...    0 °mph  27.47 °in  0.0 °in      Fair      
20   8:53 PM       97 °F      6 °F     3 °%  ...    0 °mph  27.49 °in  0.0 °in      Fair      
21   9:53 PM       95 °F      9 °F     4 °%  ...    0 °mph  27.50 °in  0.0 °in      Fair      
22  10:53 PM       90 °F     13 °F     6 °%  ...    0 °mph  27.50 °in  0.0 °in      Fair      
23  11:53 PM       87 °F     18 °F     8 °%  ...    0 °mph  27.49 °in  0.0 °in      Fair      

[24 rows x 10 columns]
  • Related