Home > Enterprise >  Create a DataFrame with ordered data from a scraped table with Beautiful Soup
Create a DataFrame with ordered data from a scraped table with Beautiful Soup

Time:09-18

I am trying to create a DataFrame with the CPI data scraped from a website with Beautiful Soup, sorted by year and Country. The DF would have the following structure:

country year q1 q2 q3 q4 year
Australia 2022 123.9 126.1
Australia 2021 117.9 118.8 119.7 117.2 119.4
...

I am able to get all the needed data from the following script:

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs


list_countries=['australia','canada','brazil','italy','japan','mexico','new-zealand','france','germany','philippines','india','korea','russia','singapore','switzerland','uk','usa']

url = 'https://www.rateinflation.com/consumer-price-index/australia-historical-cpi/'
r = requests.get(url)
soup = bs(r.text, 'html.parser')
cpi_get = soup.find('table', class_='css-8rh80p eyyd7td0')

for year in cpi_get.find_all('tr'):
    quarters = year.find_all('td')
    print(quarters)

The problem is that the output has the following structure, for each iteration:

[<td>2021</td>, <td>117.9</td>, <td>118.8</td>, <td>119.7</td>, <td>121.3</td>, <td>119.4</td>]

Having the first td as the year, the second q1,... and consecutively up to year. As it does not have any class or extra info apart of the position, I don´t know how to create the table from that.

Does anyone know how to get the desired output from that?

CodePudding user response:

Try pd.read_html:

import pandas as pd

url = "https://www.rateinflation.com/consumer-price-index/australia-historical-cpi/"

df = pd.read_html(url)[0]
df.rename(
    columns={
        "mar": "q1",
        "jun": "q2",
        "sep": "q3",
        "dec": "q4",
    },
    inplace=True,
)
df["Country"] = "Australia"

# print first 10 rows:
print(df.head(10).to_markdown())

Prints:

Year q1 q2 q3 q4 Annual Country
0 2022 123.9 126.1 nan nan nan Australia
1 2021 117.9 118.8 119.7 121.3 119.4 Australia
2 2020 116.6 114.4 116.2 117.2 116.1 Australia
3 2019 114.1 114.8 115.4 116.2 115.1 Australia
4 2018 112.6 113 113.5 114.1 113.3 Australia
5 2017 110.5 110.7 111.4 112.1 111.2 Australia
6 2016 108.2 108.6 109.4 110 109.1 Australia
7 2015 106.8 107.5 108 108.4 107.7 Australia
8 2014 105.4 105.9 106.4 106.6 106.1 Australia
9 2013 102.4 102.8 104 104.8 103.5 Australia

EDIT:

list_countries = ["australia", "canada"]
url = "https://www.rateinflation.com/consumer-price-index/{}-historical-cpi/"

dfs = []
for country in list_countries:
    df = pd.read_html(url.format(country))[0]
    df["Country"] = country.capitalize()
    dfs.append(df)

df = pd.concat(dfs)
print(df)
  • Related