Home > OS >  Scraping Table across Multipe WebPages Using BeautifulSoup
Scraping Table across Multipe WebPages Using BeautifulSoup

Time:07-12

Link to table: https://home.treasury.gov/resource-center/data-chart-center/interest-rates/TextView?type=daily_treasury_yield_curve&field_tdr_date_value=all&page=0

This table goes from page 0 to page 27.

I have successfully scraped the table into a pandas df for page 0:

url = 'https://home.treasury.gov/resource-center/data-chart-center/interest-rates/TextView?type=daily_treasury_yield_curve&field_tdr_date_value=all&page=0' 
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')

#getting the table
table = soup.find('table', {'class':'views-table views-view-table cols-20'})
headers = []
for i in table.find_all('th'):
    title = i.text.strip()
    headers.append(title)

df = pd.DataFrame(columns = headers)
for row in table.find_all('tr')[1:]:
  data = row.find_all('td')
  row_data = [td.text.strip() for td in data]
  length = len(df)
  df.loc[length] = row_data

Now I need to do the same for all the pages and store it into a single a df.

CodePudding user response:

You can use pandas.read_html to parse tables to dataframes and then concat them:

import pandas as pd

url = "https://home.treasury.gov/resource-center/data-chart-center/interest-rates/TextView?type=daily_treasury_yield_curve&field_tdr_date_value=all&page={}"

all_df = []
for page in range(0, 10):  # <-- increase number of pages here
    print("Getting page", page)
    all_df.append(pd.read_html(url.format(page))[0])

final_df = pd.concat(all_df).reset_index(drop=True)
print(final_df.tail(10).to_markdown(index=False))
Date 20 YR 30 YR Extrapolation Factor 8 WEEKS BANK DISCOUNT COUPON EQUIVALENT 52 WEEKS BANK DISCOUNT COUPON EQUIVALENT.1 1 Mo 2 Mo 3 Mo 6 Mo 1 Yr 2 Yr 3 Yr 5 Yr 7 Yr 10 Yr 20 Yr 30 Yr
12/13/2001 nan nan nan nan nan nan nan 1.69 nan 1.69 1.78 2.2 3.09 3.62 4.4 4.9 5.13 5.81 5.53
12/14/2001 nan nan nan nan nan nan nan 1.7 nan 1.73 1.81 2.22 3.2 3.73 4.52 5.01 5.24 5.89 5.59
12/17/2001 nan nan nan nan nan nan nan 1.72 nan 1.74 1.84 2.24 3.21 3.74 4.54 5.03 5.26 5.91 5.61
12/18/2001 nan nan nan nan nan nan nan 1.72 nan 1.71 1.81 2.24 3.13 3.66 4.46 4.93 5.16 5.81 5.52
12/19/2001 nan nan nan nan nan nan nan 1.69 nan 1.69 1.8 2.23 3.11 3.63 4.38 4.84 5.08 5.73 5.45
12/20/2001 nan nan nan nan nan nan nan 1.67 nan 1.69 1.79 2.22 3.15 3.67 4.42 4.86 5.08 5.73 5.43
12/21/2001 nan nan nan nan nan nan nan 1.67 nan 1.71 1.81 2.23 3.17 3.69 4.45 4.89 5.12 5.76 5.45
12/24/2001 nan nan nan nan nan nan nan 1.66 nan 1.72 1.83 2.24 3.22 3.74 4.49 4.95 5.18 5.81 5.49
12/26/2001 nan nan nan nan nan nan nan 1.77 nan 1.75 1.87 2.34 3.26 3.8 4.55 5 5.22 5.84 5.52
12/27/2001 nan nan nan nan nan nan nan 1.75 nan 1.74 1.84 2.27 3.19 3.71 4.46 4.9 5.13 5.78 5.49

CodePudding user response:

You can make the pagination using for loop

url = 'https://home.treasury.gov/resource-center/data-chart-center/interest-rates/TextView?type=daily_treasury_yield_curve&field_tdr_date_value=all&page={p}' 

for p in range(0,27):
    page = requests.get(url.format(p=p))
    soup = BeautifulSoup(page.text, 'lxml')

#getting the table
    table = soup.find('table', {'class':'views-table views-view-table cols-20'})
    headers = []
    for i in table.find_all('th'):
        title = i.text.strip()
        headers.append(title)

df = pd.DataFrame(columns = headers)
for row in table.find_all('tr')[1:]:
  data = row.find_all('td')
  row_data = [td.text.strip() for td in data]
  length = len(df)
  df.loc[length] = row_data
  • Related