I would like to parse the table "Table 1: Consumer Price Index, historical indices from 1924 (2015=100)" from here:
I am using Selenium to open the table that I want to parse (see code below). But the line with pd.read_html throws me the error message
ImportError: html5lib not found, please install it
even though I have installed html5lib (also checked using pip list
, version 1.1 is installed). How can I best parse the table?
options = Options()
url = "https://www.ssb.no/en/priser-og-prisindekser/konsumpriser/statistikk/konsumprisindeksen"
driver_no = webdriver.Chrome(options=options, executable_path=mypath)
driver_no.get(url)
sleep(2)
WebDriverWait(driver_no, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="attachment-table-figure-1"]/button')))
elem = driver_no.find_element(By.XPATH, '//*[@id="attachment-table-figure-1"]/button')
sleep(2)
driver_no.execute_script("arguments[0].scrollIntoView(true);", elem)
sleep(2)
driver_no.find_element(By.XPATH, '//*[@id="attachment-table-figure-1"]/button').click()
df_list = pd.read_html(driver_no.page_source, "html_parser")
driver_no.quit()
CodePudding user response:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
browser.get("https://www.ssb.no/en/priser-og-prisindekser/konsumpriser/statistikk/konsumprisindeksen")
soup = BeautifulSoup(browser.page_source, 'html5lib')
table = soup.select('table')[1]
browser.quit()
final_list = []
for row in table.select('tr'):
final_list.append([x.text for x in row.find_all(['td', 'th'])])
final_df = pd.DataFrame(final_list[1:], columns = final_list[:1])
final_df[:-2]
This returns the actual table:
Y-avg2 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0 2022 . 117.8 119.1 119.8 121.2 121.5 122.6 . . . . . .
1 2021 116.1 114.1 114.9 114.6 115.0 114.9 115.3 116.3 116.3 117.5 117.2 118.1 118.9
2 2020 112.2 111.3 111.2 111.2 111.7 111.9 112.1 112.9 112.5 112.9 113.2 112.4 112.9
3 2019 110.8 109.3 110.2 110.4 110.8 110.5 110.6 111.4 110.6 111.1 111.3 111.6 111.3
4 2018 108.4 106.0 107.0 107.3 107.7 107.8 108.5 109.3 108.9 109.5 109.3 109.8 109.8
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
89 1933 2.7 2.7 2.7 2.7 2.7 2.7 2.7 2.7 2.8 2.7 2.7 2.7 2.7
90 1932 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8
91 1931 2.8 2.9 2.9 2.9 2.9 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8
92 1930 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 2.9 2.9 2.9
93 1929 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1
Regarding your 'html5lib' issue, without looking at your actual install/virtualenv etc, there is not much help I can offer. Maybe try reinstalling it, or try installing it in a new virtual environment.