Home > Software engineering >  Python webscraping: How to parse html table, selenium
Python webscraping: How to parse html table, selenium

Time:07-17

I would like to parse the table "Table 1: Consumer Price Index, historical indices from 1924 (2015=100)" from here: See picture

I am using Selenium to open the table that I want to parse (see code below). But the line with pd.read_html throws me the error message

ImportError: html5lib not found, please install it

even though I have installed html5lib (also checked using pip list, version 1.1 is installed). How can I best parse the table?

options = Options()

url = "https://www.ssb.no/en/priser-og-prisindekser/konsumpriser/statistikk/konsumprisindeksen"
driver_no = webdriver.Chrome(options=options, executable_path=mypath)

driver_no.get(url)
sleep(2)
WebDriverWait(driver_no, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="attachment-table-figure-1"]/button')))
elem = driver_no.find_element(By.XPATH, '//*[@id="attachment-table-figure-1"]/button')
sleep(2)
driver_no.execute_script("arguments[0].scrollIntoView(true);", elem)
sleep(2)
driver_no.find_element(By.XPATH, '//*[@id="attachment-table-figure-1"]/button').click()

df_list = pd.read_html(driver_no.page_source, "html_parser")
driver_no.quit()

CodePudding user response:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup


chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

browser.get("https://www.ssb.no/en/priser-og-prisindekser/konsumpriser/statistikk/konsumprisindeksen")
soup = BeautifulSoup(browser.page_source, 'html5lib')
table = soup.select('table')[1]
browser.quit()
final_list = []
for row in table.select('tr'):
    final_list.append([x.text for x in row.find_all(['td', 'th'])])
final_df = pd.DataFrame(final_list[1:], columns = final_list[:1])
final_df[:-2]

This returns the actual table:

        Y-avg2  Jan     Feb     Mar     Apr     May     Jun     Jul     Aug     Sep     Oct     Nov     Dec
0   2022    .   117.8   119.1   119.8   121.2   121.5   122.6   .   .   .   .   .   .
1   2021    116.1   114.1   114.9   114.6   115.0   114.9   115.3   116.3   116.3   117.5   117.2   118.1   118.9
2   2020    112.2   111.3   111.2   111.2   111.7   111.9   112.1   112.9   112.5   112.9   113.2   112.4   112.9
3   2019    110.8   109.3   110.2   110.4   110.8   110.5   110.6   111.4   110.6   111.1   111.3   111.6   111.3
4   2018    108.4   106.0   107.0   107.3   107.7   107.8   108.5   109.3   108.9   109.5   109.3   109.8   109.8
...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
89  1933    2.7     2.7     2.7     2.7     2.7     2.7     2.7     2.7     2.8     2.7     2.7     2.7     2.7
90  1932    2.8     2.8     2.8     2.8     2.8     2.8     2.8     2.8     2.8     2.8     2.8     2.8     2.8
91  1931    2.8     2.9     2.9     2.9     2.9     2.8     2.8     2.8     2.8     2.8     2.8     2.8     2.8
92  1930    3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     2.9     2.9     2.9
93  1929    3.1     3.1     3.1     3.1     3.1     3.1     3.1     3.1     3.1     3.1     3.1     3.1     3.1

Regarding your 'html5lib' issue, without looking at your actual install/virtualenv etc, there is not much help I can offer. Maybe try reinstalling it, or try installing it in a new virtual environment.

  • Related