Home > OS >  How to scrape using multiple urls after extracting a specific html value using Beautiful Soup
How to scrape using multiple urls after extracting a specific html value using Beautiful Soup

Time:10-21

I'm trying to scrape daily fund prices from the FT website. The URL is of the following type: https://markets.ft.com/data/funds/tearsheet/historical?s=LU0526609390:EUR, where LU0526609390 is the ISIN for the fund. The issue is that there is a filter to access more than the last 30 daily fund prices.

However, the data is loaded from an API that allows to set date ranges, e.g. For fund LU052660390 I have to use the following URL: https://markets.ft.com/data/equities/ajax/get-historical-prices?startDate=2020/10/01&endDate=2021/10/01&symbol=535700333. This makes it possible to skip the filters issue.

From the original fund URL I want to extract the symbol (xid = 535700333), then request all the daily prices available and finally export the information into a ISIN.csv file.

I have the following code using 4 fund URLs as an example:

from bs4 import BeautifulSoup
import requests
import json
import time
import pandas as pd
from datetime import datetime

#Create url list
urls = ['https://markets.ft.com/data/funds/tearsheet/historical?s=LU0526609390:EUR', 'https://markets.ft.com/data/funds/tearsheet/historical?s=IE00BHBX0Z19:EUR', 
'https://markets.ft.com/data/funds/tearsheet/historical?s=LU1076093779:EUR', 'https://markets.ft.com/data/funds/tearsheet/historical?s=LU1116896363:EUR']

#create list of annual dates for the past 100 years starting from today
datelist = pd.date_range(end=datetime.now(),periods=100,freq=pd.DateOffset(years=1))[::-1].strftime('%Y/%m/%d')

#Build Dataframe
df = pd.DataFrame(None, columns=['Date','Open','High','Low','Close','Volume'])

# Change date format as there appears to be two versions of the date on the FT website for different sized browsers
def format_date(date):
    date = date.split(',')[-2][1:]   date.split(',')[-1]

    return pd.Series({'Date': date})

# Build the scrapping loop
for url in urls:
    ISIN = url.split('=')[-1].replace(':', '_')
    ISIN = ISIN[:-4]
    # Extract HTML element (symbol) from original fund url 
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    elemList = soup.find_all('section', {'class':'mod-tearsheet-add-to-watchlist'})
    for elem in elemList:
        elemID = elem.get('class')
        elemName = elem.get('data-mod-config')
        if elemName is None:
            pass
        elif 'xid' in elemName:
            data = json.loads(elemName)
            val1 = data['xid']
            # We have now extracted the fund xid so we can start the request loop using the API.
            while True:
                for end, start in zip(datelist, datelist[1:]):
                    try:
                         r = requests.get(f'https://markets.ft.com/data/equities/ajax/get-historical-prices?startDate={start}&endDate={end}&symbol={val1}').json()
                         df_temp = pd.read_html('<table>' r['html'] '</table>')[0]
                         df_temp.columns=['Date','Open','High','Low','Close','Volume']
                         df['Date'] = df['Date'].apply(format_date)
                         df.to_csv(r'/Users//'   ISIN   '.csv', index=False)
                    except:
                        break
                break

However, I keep getting the same ValueError: No tables found matching pattern '. ' error message. I'm new to Python so any help as to why I'm getting this error, would be greatly appreciated! Thanks

CodePudding user response:

I don't get the error, runs just fine on my end. The only thing I'd suggest is to add in the headers parameter with the user-agent. You may not be getting a 200 response. It may also be helpful to put in some print statements in your process so that you can debug and see where the code trips up. Try this and see if it helps:

from bs4 import BeautifulSoup
import requests
import json
import time
import pandas as pd
from datetime import datetime

#Create url list
urls = ['https://markets.ft.com/data/funds/tearsheet/historical?s=LU0526609390:EUR', 'https://markets.ft.com/data/funds/tearsheet/historical?s=IE00BHBX0Z19:EUR', 
'https://markets.ft.com/data/funds/tearsheet/historical?s=LU1076093779:EUR', 'https://markets.ft.com/data/funds/tearsheet/historical?s=LU1116896363:EUR']

#create list of annual dates for the past 100 years starting from today
datelist = pd.date_range(end=datetime.now(),periods=100,freq=pd.DateOffset(years=1))[::-1].strftime('%Y/%m/%d')

#Build Dataframe
df = pd.DataFrame(None, columns=['Date','Open','High','Low','Close','Volume'])

# Change date format as there appears to be two versions of the date on the FT website for different sized browsers
def format_date(date):
    date = date.split(',')[-2][1:]   date.split(',')[-1]

    return pd.Series({'Date': date})

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'}

# Build the scrapping loop
for url in urls:
    ISIN = url.split('=')[-1].replace(':', '_')
    ISIN = ISIN[:-4]
    # Extract HTML element (symbol) from original fund url 
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")
    elemList = soup.find_all('section', {'class':'mod-tearsheet-add-to-watchlist'})
    for elem in elemList:
        elemID = elem.get('class')
        elemName = elem.get('data-mod-config')
        if elemName is None:
            pass
        elif 'xid' in elemName:
            data = json.loads(elemName)
            val1 = data['xid']
            # We have now extracted the fund xid so we can start the request loop using the API.
            while True:
                for end, start in zip(datelist, datelist[1:]):
                    try:
                         r = requests.get(f'https://markets.ft.com/data/equities/ajax/get-historical-prices?startDate={start}&endDate={end}&symbol={val1}', headers=headers).json()
                         df_temp = pd.read_html('<table>' r['html'] '</table>')[0]
                         df_temp.columns=['Date','Open','High','Low','Close','Volume']
                         df['Date'] = df['Date'].apply(format_date)
                         df.to_csv(r'/Users//'   ISIN   '.csv', index=False)
                    except:
                        break
                break
  • Related