Home > OS >  How to scrape historical data when date range is in a filter
How to scrape historical data when date range is in a filter

Time:10-19

I'm trying to scrape some historical data on the following url: https://markets.ft.com/data/funds/tearsheet/historical?s=LU0526609390:EUR I would like to scrape all the available historical data, however, the website only allows me to scrape the daily prices of the last 30 days. In order to go back further, I have to make use of the filters and can only filter one year at a time.

I can easily scrape the information available on the first table using the following code for a couple of funds:

import pandas as pd
import datetime
import csv

urls = ['https://markets.ft.com/data/funds/tearsheet/historical?s=LU0526609390:EUR', 'https://markets.ft.com/data/funds/tearsheet/historical?s=IE00BHBX0Z19:EUR', 
'https://markets.ft.com/data/funds/tearsheet/historical?s=LU1076093779:EUR', 'https://markets.ft.com/data/funds/tearsheet/historical?s=LU1116896363:EUR']

# Change date format as there appears to be two versions of the date on the FT website for different sized browsers

def format_date(date):
    date = date.split(',')[-2][1:]   date.split(',')[-1]

    return pd.Series({'Date': date})

# Create list to allow all scrapping data to be saved in one .csv file

dfs = []

# Create scrapping loop for all defined urls

for url in urls:
    ISIN = url.split('=')[-1].replace(':', '_')
    html = requests.get(url).content
    df_list = pd.read_html(html)
    df = df_list[-1]
    df['Date'] = df['Date'].apply(format_date)
    print (df)
    dfs.append(df)

However, I'm unable to make use of the filters on the webpage to obtain more historical data? I've tried many things but always get a different error message. Is there a way to do this? I'm really stuck so any help would be greatly appreciated! Thanks

CodePudding user response:

The data is loaded from an api that allows to set date ranges, eg https://markets.ft.com/data/equities/ajax/get-historical-prices?startDate=2020/10/01&endDate=2021/10/01&symbol=535700333. This makes it possible to skip the filters issue:

import requests
import pandas as pd
from datetime import datetime
import time

#create list of annual dates for the past 100 years starting from today
datelist = pd.date_range(end=datetime.now(),periods=100,freq=pd.DateOffset(years=1))[::-1].strftime('%Y/%m/%d')
#create empty df
df = pd.DataFrame(None, columns=['Date','Open','High','Low','Close','Volume'])

#not sure when the historical data starts, so let's wrap it in a while loop
while True:
    for end, start in zip(datelist, datelist[1:]):
        try:
            r = requests.get(f'https://markets.ft.com/data/equities/ajax/get-historical-prices?startDate={start}&endDate={end}&symbol=535700333').json()
            df_temp = pd.read_html('<table>' r['html'] '</table>')[0]
            df_temp.columns=['Date','Open','High','Low','Close','Volume']
            df = df.append(df_temp)
            time.sleep(2)
        except:
            break
    break

Output:

Date Open High Low Close Volume
0 Friday, October 15, 2021Fri, Oct 15, 2021 78.8 78.8 78.8 78.8 ----
1 Thursday, October 14, 2021Thu, Oct 14, 2021 78.89 78.89 78.89 78.89 ----
2 Wednesday, October 13, 2021Wed, Oct 13, 2021 78.7 78.7 78.7 78.7 ----
3 Tuesday, October 12, 2021Tue, Oct 12, 2021 78.58 78.58 78.58 78.58 ----
4 Monday, October 11, 2021Mon, Oct 11, 2021 78.58 78.58 78.58 78.58 ----
  • Related