Home > database >  How to find an HTML table with a different tab using beautiful soup
How to find an HTML table with a different tab using beautiful soup

Time:11-24

I'm trying to scrape some Assets & Holdings information on the FT website for a number of funds (i.e. https://markets.ft.com/data/funds/tearsheet/holdings?s=LU1076093779:EUR). I'm able to select the first table with no issues but the second table consists of two tables with different tabs: 'Sectors' and 'Regions'. When I try to select the second table with table2 = soup.find_all('table')[1], I get either the table under the 'Sectors' tab or the 'Regions' tab. Is there a way to select both tables?

My code is as follows:

import requests
import pandas as pd
from bs4 import BeautifulSoup

List = ['LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']

df = pd.DataFrame(List, columns=['List'])
urls = 'https://markets.ft.com/data/funds/tearsheet/holdings?s='  df['List']

dfs =[]
for url in urls:
    ISIN = url.split('=')[-1].replace(':', '_')
    ISIN = ISIN[:-4]
    r = requests.get(url).content
    soup = BeautifulSoup(r, 'html.parser')
    try:
        table1 = soup.find_all('table')[0]
        table2 = soup.find_all('table')[1]
    except Exception:
        continue   
    df1 = pd.read_html(str(table1), index_col=0)[0]
    df2 = pd.read_html(str(table2), index_col=0)[0]
    del df2['Category average']
    del df1['% Short']
    del df1['% Long']
    df1 = df1.rename(columns={'% Net assets': ISIN})
    df2 = df2.rename(columns={'% Net assets': ISIN})
    df = df1.append(df2)
    print(df)

My required output for fund - LU1076093779:

                      LU1076093779
Non-UK stock                 94.76%
Cash                          2.21%
UK stock                      3.03%
UK bond                       0.00%
Non-UK bond                   0.00%
Other                         0.00%
Financial Services           16.96%
Industrials                  14.22%
Consumer Cyclical            13.80%
Technology                   13.65%
Healthcare                   11.08%
Consumer Defensive            8.09%
Communication Services        7.20%
Basic Materials               6.04%
Utilities                     3.62%
Other                         3.10%
Americas                      1.37% 
United States                 0.79% 
Latin America                 0.58% 
Greater Asia                  0.00% 
Greater Europe                96.02%    
Eurozone                      91.79%    
United Kingdom                3.03% 
Europe - ex Euro              1.20% 

but at the moment I just get the following:

                       LU1076093779
Non-UK stock                 94.76%
Cash                          2.21%
UK stock                      3.03%
UK bond                       0.00%
Non-UK bond                   0.00%
Other                         0.00%
Financial Services           16.96%
Industrials                  14.22%
Consumer Cyclical            13.80%
Technology                   13.65%
Healthcare                   11.08%
Consumer Defensive            8.09%
Communication Services        7.20%
Basic Materials               6.04%
Utilities                     3.62%
Other                         3.10%

CodePudding user response:

It's all there just need to put the tables together after some data manipulation. Note though some of those links don't have all 3 tables. I also changed how you iterate through the list. No need to create a dataframe.

Also be careful with your variables. I think you meant dfs = df1.append(df2). But I adjusted the code a bit.

import requests
import pandas as pd
from bs4 import BeautifulSoup

# Define all urls required for data scrapping from the FT Website - if new fund is added simply add the appropriate Fund ID to the List
id_list = ['LU1076093779:EUR','LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']
urls = ['https://markets.ft.com/data/funds/tearsheet/holdings?s='  x for x in id_list]


for url in urls:
    print(url)
    ISIN = url.split('=')[-1].replace(':', '_')
    ISIN = ISIN[:-4]
    df_list = []
    r = requests.get(url).content
    soup = BeautifulSoup(r, 'html.parser')
    
    all_colspan = soup.find_all(attrs={'colspan':True})
    for colspan in all_colspan:
        colspan.attrs['colspan'] = colspan.attrs['colspan'].replace('%', '')
        
    dfs = pd.read_html(str(soup))
    for df in dfs:
        cols = df.columns
        drop_cols = ['Category average', '% Short', '% Long']
        if 'Type' in cols or 'Sector' in cols:
            df = df.rename(columns={'% Net assets': ISIN,
                                    'Sector':'Type'})
            df = df.drop([x for x in drop_cols if x in df.columns], axis=1)
            df_list.append(df)
            
    result = pd.concat(df_list)
    print(result)
  • Related