Home > Enterprise >  How to scrape data from html table with python?
How to scrape data from html table with python?

Time:10-17

I am trying to scrape data from the html tables on this page and export it to a csv.

The only success i've had is with extracting the headers. I thought the problem might be with the page not fully loading before the data is scraped, hence my use of the the 'requests_html' library, but the issue still persists.

Here's the code i am using:

import requests
from bs4 import BeautifulSoup
import csv
from requests_html import HTMLSession

url = 'https://www.fidelitypensionmanagers.com/Home/PriceHistory'
s = HTMLSession()
r = s.get(url)
r.html.render(sleep=2)

soup = BeautifulSoup(r.content.decode('utf8'),"lxml")
table = soup.find("table",{"id":"fund-I"}) # to select the right table

# find all rows
rows = table.findAll('tr')

# strip the header from rows
headers = rows[0]
header_text = []

# add the table header text to array
for th in headers.findAll('th'):
    header_text.append(th.text)

# init row text array
row_text_array = []

# loop through rows and add row text to array
for row in rows[1:]:
    row_text = []
    # loop through the elements
    for row_element in row.findAll(['th', 'td']):
        # append the array with the elements inner text
        row_text.append(row_element.text.replace('\n', '').strip())
    # append the text array to the row text array
    row_text_array.append(row_text)

with open("out.csv", "w") as f:
    wr = csv.writer(f)
    wr.writerow(header_text)
    for row_text_single in row_text_array:
        wr.writerow(row_text_single) 

Any help would be grateful, thanks

CodePudding user response:

Your assumption is correct, the contents are only reloaded when the corresponding areas are opened, so you should also start at this point and retrieve the data via these separate requests.

Simply change the value of id parameter for https://www.fidelitypensionmanagers.com/Services/GetPriceHistory?id=1&format=json

You can get the ids from the onclick:

<div  id="accordionOne">
    <div  onclick="getUnitPrices(12)"></div>
    <div  onclick="getUnitPrices(1)"></div>
    <div  onclick="getUnitPrices(13)"></div>
    <div  onclick="getUnitPrices(2)"></div>
    <div  onclick="getUnitPrices(9)"></div>

Example

simplest approach would be to use pandas in my opinion:

import pandas as pd
pd.read_json('https://www.fidelitypensionmanagers.com/Services/GetPriceHistory?id=1&format=json').to_csv('myfile.csv', index=False)

Output

SCHEME_ID,VALUATION_DATE,BID_PRICE_BF,OFFER_PRICE_BF,TOTAL_UNITS,ASSET_VALUE,STOCKS_VALUE,UNQUOTED_VALUE,UNQUOTED_APPR,UNINVESTED_CASH,UNDISTR_INCOME,MM_VALUE,STAMP_DUTIES,NSE_CSCS,BROKERAGE_FEES,TOTAL_BID_VALUE,TOTAL_OFFER_VALUE,BID_PRICE,OFFER_PRICE,APPR_TO_DATE,ID,PRICE_MOVEMENT,PRICE_CHANGE
1,/Date(1665644400000)/,0,0,22197476839.3793,93586560820.23,6562586060.8,0,0,395651878.65,0,9143340045.7,0,0,0,93586782102.51,93586782102.51,4.2161,4.2161,105797375.82,77484,,
1,/Date(1665558000000)/,0,0,22178176379.013,93502175535.94,6549468002.14,0,0,274281137.71,0,9179939634.67,0,0,0,93503191613.92,93503191613.92,4.216,4.216,92679317.16,77463,,
1,/Date(1665471600000)/,0,0,22173955797.4287,93469684659.81,6557409158.97,0,0,251510018.56,0,9177109809.92,0,0,0,93469875872.9,93469875872.9,4.2153,4.2153,100620473.99,77453,,
1,/Date(1665385200000)/,0,0,22164552155.3263,93418356788.97,6565624234.22,0,0,168155797.81,0,9174279985.2,0,0,0,93419154424.27,93419154424.27,4.2148,4.2148,108835549.24,77413,,
1,/Date(1665298800000)/,0,0,22164552155.3263,93396581110.67,6565624234.22,0,0,168155797.81,0,9171485395.41,0,0,0,93396989872.11,93396989872.11,4.2138,4.2138,108835549.24,77393,,
1,/Date(1665212400000)/,0,0,22164552155.3263,93374808849.27,6565624234.22,0,0,168155797.81,0,9168690805.6,0,0,0,93374825319.96,93374825319.96,4.2128,4.2128,108835549.24,77373,,
1,/Date(1665126000000)/,0,0,22164552155.3263,93353040003.64,6565624234.22,0,0,168155797.81,0,9165896215.82,0,0,0,93352660767.8,93352660767.8,4.2118,4.2118,108835549.24,77353,,
...
  • Related