I am trying to scrape data from the html tables on this page and export it to a csv.
The only success i've had is with extracting the headers. I thought the problem might be with the page not fully loading before the data is scraped, hence my use of the the 'requests_html' library, but the issue still persists.
Here's the code i am using:
import requests
from bs4 import BeautifulSoup
import csv
from requests_html import HTMLSession
url = 'https://www.fidelitypensionmanagers.com/Home/PriceHistory'
s = HTMLSession()
r = s.get(url)
r.html.render(sleep=2)
soup = BeautifulSoup(r.content.decode('utf8'),"lxml")
table = soup.find("table",{"id":"fund-I"}) # to select the right table
# find all rows
rows = table.findAll('tr')
# strip the header from rows
headers = rows[0]
header_text = []
# add the table header text to array
for th in headers.findAll('th'):
header_text.append(th.text)
# init row text array
row_text_array = []
# loop through rows and add row text to array
for row in rows[1:]:
row_text = []
# loop through the elements
for row_element in row.findAll(['th', 'td']):
# append the array with the elements inner text
row_text.append(row_element.text.replace('\n', '').strip())
# append the text array to the row text array
row_text_array.append(row_text)
with open("out.csv", "w") as f:
wr = csv.writer(f)
wr.writerow(header_text)
for row_text_single in row_text_array:
wr.writerow(row_text_single)
Any help would be grateful, thanks
CodePudding user response:
Your assumption is correct, the contents are only reloaded when the corresponding areas are opened, so you should also start at this point and retrieve the data via these separate requests.
Simply change the value of id
parameter for https://www.fidelitypensionmanagers.com/Services/GetPriceHistory?id=1&format=json
You can get the ids
from the onclick
:
<div id="accordionOne">
<div onclick="getUnitPrices(12)"></div>
<div onclick="getUnitPrices(1)"></div>
<div onclick="getUnitPrices(13)"></div>
<div onclick="getUnitPrices(2)"></div>
<div onclick="getUnitPrices(9)"></div>
Example
simplest approach would be to use pandas
in my opinion:
import pandas as pd
pd.read_json('https://www.fidelitypensionmanagers.com/Services/GetPriceHistory?id=1&format=json').to_csv('myfile.csv', index=False)
Output
SCHEME_ID,VALUATION_DATE,BID_PRICE_BF,OFFER_PRICE_BF,TOTAL_UNITS,ASSET_VALUE,STOCKS_VALUE,UNQUOTED_VALUE,UNQUOTED_APPR,UNINVESTED_CASH,UNDISTR_INCOME,MM_VALUE,STAMP_DUTIES,NSE_CSCS,BROKERAGE_FEES,TOTAL_BID_VALUE,TOTAL_OFFER_VALUE,BID_PRICE,OFFER_PRICE,APPR_TO_DATE,ID,PRICE_MOVEMENT,PRICE_CHANGE
1,/Date(1665644400000)/,0,0,22197476839.3793,93586560820.23,6562586060.8,0,0,395651878.65,0,9143340045.7,0,0,0,93586782102.51,93586782102.51,4.2161,4.2161,105797375.82,77484,,
1,/Date(1665558000000)/,0,0,22178176379.013,93502175535.94,6549468002.14,0,0,274281137.71,0,9179939634.67,0,0,0,93503191613.92,93503191613.92,4.216,4.216,92679317.16,77463,,
1,/Date(1665471600000)/,0,0,22173955797.4287,93469684659.81,6557409158.97,0,0,251510018.56,0,9177109809.92,0,0,0,93469875872.9,93469875872.9,4.2153,4.2153,100620473.99,77453,,
1,/Date(1665385200000)/,0,0,22164552155.3263,93418356788.97,6565624234.22,0,0,168155797.81,0,9174279985.2,0,0,0,93419154424.27,93419154424.27,4.2148,4.2148,108835549.24,77413,,
1,/Date(1665298800000)/,0,0,22164552155.3263,93396581110.67,6565624234.22,0,0,168155797.81,0,9171485395.41,0,0,0,93396989872.11,93396989872.11,4.2138,4.2138,108835549.24,77393,,
1,/Date(1665212400000)/,0,0,22164552155.3263,93374808849.27,6565624234.22,0,0,168155797.81,0,9168690805.6,0,0,0,93374825319.96,93374825319.96,4.2128,4.2128,108835549.24,77373,,
1,/Date(1665126000000)/,0,0,22164552155.3263,93353040003.64,6565624234.22,0,0,168155797.81,0,9165896215.82,0,0,0,93352660767.8,93352660767.8,4.2118,4.2118,108835549.24,77353,,
...