Scraping Table across Multipe WebPages Using BeautifulSoup-CodePudding

Link to table: https://home.treasury.gov/resource-center/data-chart-center/interest-rates/TextView?type=daily_treasury_yield_curve&field_tdr_date_value=all&page=0

This table goes from page 0 to page 27.

I have successfully scraped the table into a pandas df for page 0:

url = 'https://home.treasury.gov/resource-center/data-chart-center/interest-rates/TextView?type=daily_treasury_yield_curve&field_tdr_date_value=all&page=0' 
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')

#getting the table
table = soup.find('table', {'class':'views-table views-view-table cols-20'})
headers = []
for i in table.find_all('th'):
    title = i.text.strip()
    headers.append(title)

df = pd.DataFrame(columns = headers)
for row in table.find_all('tr')[1:]:
  data = row.find_all('td')
  row_data = [td.text.strip() for td in data]
  length = len(df)
  df.loc[length] = row_data

Now I need to do the same for all the pages and store it into a single a df.

CodePudding user response：

You can use pandas.read_html to parse tables to dataframes and then concat them:

import pandas as pd

url = "https://home.treasury.gov/resource-center/data-chart-center/interest-rates/TextView?type=daily_treasury_yield_curve&field_tdr_date_value=all&page={}"

all_df = []
for page in range(0, 10):  # <-- increase number of pages here
    print("Getting page", page)
    all_df.append(pd.read_html(url.format(page))[0])

final_df = pd.concat(all_df).reset_index(drop=True)
print(final_df.tail(10).to_markdown(index=False))

Date	20 YR	30 YR	Extrapolation Factor	8 WEEKS BANK DISCOUNT	COUPON EQUIVALENT	52 WEEKS BANK DISCOUNT	COUPON EQUIVALENT.1	1 Mo	2 Mo	3 Mo	6 Mo	1 Yr	2 Yr	3 Yr	5 Yr	7 Yr	10 Yr	20 Yr	30 Yr
12/13/2001	nan	nan	nan	nan	nan	nan	nan	1.69	nan	1.69	1.78	2.2	3.09	3.62	4.4	4.9	5.13	5.81	5.53
12/14/2001	nan	nan	nan	nan	nan	nan	nan	1.7	nan	1.73	1.81	2.22	3.2	3.73	4.52	5.01	5.24	5.89	5.59
12/17/2001	nan	nan	nan	nan	nan	nan	nan	1.72	nan	1.74	1.84	2.24	3.21	3.74	4.54	5.03	5.26	5.91	5.61
12/18/2001	nan	nan	nan	nan	nan	nan	nan	1.72	nan	1.71	1.81	2.24	3.13	3.66	4.46	4.93	5.16	5.81	5.52
12/19/2001	nan	nan	nan	nan	nan	nan	nan	1.69	nan	1.69	1.8	2.23	3.11	3.63	4.38	4.84	5.08	5.73	5.45
12/20/2001	nan	nan	nan	nan	nan	nan	nan	1.67	nan	1.69	1.79	2.22	3.15	3.67	4.42	4.86	5.08	5.73	5.43
12/21/2001	nan	nan	nan	nan	nan	nan	nan	1.67	nan	1.71	1.81	2.23	3.17	3.69	4.45	4.89	5.12	5.76	5.45
12/24/2001	nan	nan	nan	nan	nan	nan	nan	1.66	nan	1.72	1.83	2.24	3.22	3.74	4.49	4.95	5.18	5.81	5.49
12/26/2001	nan	nan	nan	nan	nan	nan	nan	1.77	nan	1.75	1.87	2.34	3.26	3.8	4.55	5	5.22	5.84	5.52
12/27/2001	nan	nan	nan	nan	nan	nan	nan	1.75	nan	1.74	1.84	2.27	3.19	3.71	4.46	4.9	5.13	5.78	5.49

CodePudding user response：

You can make the pagination using for loop

url = 'https://home.treasury.gov/resource-center/data-chart-center/interest-rates/TextView?type=daily_treasury_yield_curve&field_tdr_date_value=all&page={p}' 

for p in range(0,27):
    page = requests.get(url.format(p=p))
    soup = BeautifulSoup(page.text, 'lxml')

#getting the table
    table = soup.find('table', {'class':'views-table views-view-table cols-20'})
    headers = []
    for i in table.find_all('th'):
        title = i.text.strip()
        headers.append(title)

df = pd.DataFrame(columns = headers)
for row in table.find_all('tr')[1:]:
  data = row.find_all('td')
  row_data = [td.text.strip() for td in data]
  length = len(df)
  df.loc[length] = row_data