I am trying to create a DataFrame with the CPI data scraped from a website with Beautiful Soup, sorted by year and Country. The DF would have the following structure:
country | year | q1 | q2 | q3 | q4 | year |
---|---|---|---|---|---|---|
Australia | 2022 | 123.9 | 126.1 | |||
Australia | 2021 | 117.9 | 118.8 | 119.7 | 117.2 | 119.4 |
... |
I am able to get all the needed data from the following script:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
list_countries=['australia','canada','brazil','italy','japan','mexico','new-zealand','france','germany','philippines','india','korea','russia','singapore','switzerland','uk','usa']
url = 'https://www.rateinflation.com/consumer-price-index/australia-historical-cpi/'
r = requests.get(url)
soup = bs(r.text, 'html.parser')
cpi_get = soup.find('table', class_='css-8rh80p eyyd7td0')
for year in cpi_get.find_all('tr'):
quarters = year.find_all('td')
print(quarters)
The problem is that the output has the following structure, for each iteration:
[<td>2021</td>, <td>117.9</td>, <td>118.8</td>, <td>119.7</td>, <td>121.3</td>, <td>119.4</td>]
Having the first td as the year, the second q1,... and consecutively up to year. As it does not have any class or extra info apart of the position, I don´t know how to create the table from that.
Does anyone know how to get the desired output from that?
CodePudding user response:
Try pd.read_html
:
import pandas as pd
url = "https://www.rateinflation.com/consumer-price-index/australia-historical-cpi/"
df = pd.read_html(url)[0]
df.rename(
columns={
"mar": "q1",
"jun": "q2",
"sep": "q3",
"dec": "q4",
},
inplace=True,
)
df["Country"] = "Australia"
# print first 10 rows:
print(df.head(10).to_markdown())
Prints:
Year | q1 | q2 | q3 | q4 | Annual | Country | |
---|---|---|---|---|---|---|---|
0 | 2022 | 123.9 | 126.1 | nan | nan | nan | Australia |
1 | 2021 | 117.9 | 118.8 | 119.7 | 121.3 | 119.4 | Australia |
2 | 2020 | 116.6 | 114.4 | 116.2 | 117.2 | 116.1 | Australia |
3 | 2019 | 114.1 | 114.8 | 115.4 | 116.2 | 115.1 | Australia |
4 | 2018 | 112.6 | 113 | 113.5 | 114.1 | 113.3 | Australia |
5 | 2017 | 110.5 | 110.7 | 111.4 | 112.1 | 111.2 | Australia |
6 | 2016 | 108.2 | 108.6 | 109.4 | 110 | 109.1 | Australia |
7 | 2015 | 106.8 | 107.5 | 108 | 108.4 | 107.7 | Australia |
8 | 2014 | 105.4 | 105.9 | 106.4 | 106.6 | 106.1 | Australia |
9 | 2013 | 102.4 | 102.8 | 104 | 104.8 | 103.5 | Australia |
EDIT:
list_countries = ["australia", "canada"]
url = "https://www.rateinflation.com/consumer-price-index/{}-historical-cpi/"
dfs = []
for country in list_countries:
df = pd.read_html(url.format(country))[0]
df["Country"] = country.capitalize()
dfs.append(df)
df = pd.concat(dfs)
print(df)