I'm trying to scrape some historical data on the following url: https://markets.ft.com/data/funds/tearsheet/historical?s=LU0526609390:EUR I would like to scrape all the available historical data, however, the website only allows me to scrape the daily prices of the last 30 days. In order to go back further, I have to make use of the filters and can only filter one year at a time.
I can easily scrape the information available on the first table using the following code for a couple of funds:
import pandas as pd
import datetime
import csv
urls = ['https://markets.ft.com/data/funds/tearsheet/historical?s=LU0526609390:EUR', 'https://markets.ft.com/data/funds/tearsheet/historical?s=IE00BHBX0Z19:EUR',
'https://markets.ft.com/data/funds/tearsheet/historical?s=LU1076093779:EUR', 'https://markets.ft.com/data/funds/tearsheet/historical?s=LU1116896363:EUR']
# Change date format as there appears to be two versions of the date on the FT website for different sized browsers
def format_date(date):
date = date.split(',')[-2][1:] date.split(',')[-1]
return pd.Series({'Date': date})
# Create list to allow all scrapping data to be saved in one .csv file
dfs = []
# Create scrapping loop for all defined urls
for url in urls:
ISIN = url.split('=')[-1].replace(':', '_')
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
df['Date'] = df['Date'].apply(format_date)
print (df)
dfs.append(df)
However, I'm unable to make use of the filters on the webpage to obtain more historical data? I've tried many things but always get a different error message. Is there a way to do this? I'm really stuck so any help would be greatly appreciated! Thanks
CodePudding user response:
The data is loaded from an api that allows to set date ranges, eg https://markets.ft.com/data/equities/ajax/get-historical-prices?startDate=2020/10/01&endDate=2021/10/01&symbol=535700333
. This makes it possible to skip the filters issue:
import requests
import pandas as pd
from datetime import datetime
import time
#create list of annual dates for the past 100 years starting from today
datelist = pd.date_range(end=datetime.now(),periods=100,freq=pd.DateOffset(years=1))[::-1].strftime('%Y/%m/%d')
#create empty df
df = pd.DataFrame(None, columns=['Date','Open','High','Low','Close','Volume'])
#not sure when the historical data starts, so let's wrap it in a while loop
while True:
for end, start in zip(datelist, datelist[1:]):
try:
r = requests.get(f'https://markets.ft.com/data/equities/ajax/get-historical-prices?startDate={start}&endDate={end}&symbol=535700333').json()
df_temp = pd.read_html('<table>' r['html'] '</table>')[0]
df_temp.columns=['Date','Open','High','Low','Close','Volume']
df = df.append(df_temp)
time.sleep(2)
except:
break
break
Output:
Date | Open | High | Low | Close | Volume | |
---|---|---|---|---|---|---|
0 | Friday, October 15, 2021Fri, Oct 15, 2021 | 78.8 | 78.8 | 78.8 | 78.8 | ---- |
1 | Thursday, October 14, 2021Thu, Oct 14, 2021 | 78.89 | 78.89 | 78.89 | 78.89 | ---- |
2 | Wednesday, October 13, 2021Wed, Oct 13, 2021 | 78.7 | 78.7 | 78.7 | 78.7 | ---- |
3 | Tuesday, October 12, 2021Tue, Oct 12, 2021 | 78.58 | 78.58 | 78.58 | 78.58 | ---- |
4 | Monday, October 11, 2021Mon, Oct 11, 2021 | 78.58 | 78.58 | 78.58 | 78.58 | ---- |