I'm trying to scrape some Assets & Holdings information on the FT website for a number of funds (i.e. https://markets.ft.com/data/funds/tearsheet/holdings?s=LU1076093779:EUR). I'm able to select the first table with no issues but the second table consists of two tables with different tabs: 'Sectors' and 'Regions'. When I try to select the second table with table2 = soup.find_all('table')[1]
, I get either the table under the 'Sectors' tab or the 'Regions' tab. Is there a way to select both tables?
My code is as follows:
import requests
import pandas as pd
from bs4 import BeautifulSoup
List = ['LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']
df = pd.DataFrame(List, columns=['List'])
urls = 'https://markets.ft.com/data/funds/tearsheet/holdings?s=' df['List']
dfs =[]
for url in urls:
ISIN = url.split('=')[-1].replace(':', '_')
ISIN = ISIN[:-4]
r = requests.get(url).content
soup = BeautifulSoup(r, 'html.parser')
try:
table1 = soup.find_all('table')[0]
table2 = soup.find_all('table')[1]
except Exception:
continue
df1 = pd.read_html(str(table1), index_col=0)[0]
df2 = pd.read_html(str(table2), index_col=0)[0]
del df2['Category average']
del df1['% Short']
del df1['% Long']
df1 = df1.rename(columns={'% Net assets': ISIN})
df2 = df2.rename(columns={'% Net assets': ISIN})
df = df1.append(df2)
print(df)
My required output for fund - LU1076093779:
LU1076093779
Non-UK stock 94.76%
Cash 2.21%
UK stock 3.03%
UK bond 0.00%
Non-UK bond 0.00%
Other 0.00%
Financial Services 16.96%
Industrials 14.22%
Consumer Cyclical 13.80%
Technology 13.65%
Healthcare 11.08%
Consumer Defensive 8.09%
Communication Services 7.20%
Basic Materials 6.04%
Utilities 3.62%
Other 3.10%
Americas 1.37%
United States 0.79%
Latin America 0.58%
Greater Asia 0.00%
Greater Europe 96.02%
Eurozone 91.79%
United Kingdom 3.03%
Europe - ex Euro 1.20%
but at the moment I just get the following:
LU1076093779
Non-UK stock 94.76%
Cash 2.21%
UK stock 3.03%
UK bond 0.00%
Non-UK bond 0.00%
Other 0.00%
Financial Services 16.96%
Industrials 14.22%
Consumer Cyclical 13.80%
Technology 13.65%
Healthcare 11.08%
Consumer Defensive 8.09%
Communication Services 7.20%
Basic Materials 6.04%
Utilities 3.62%
Other 3.10%
CodePudding user response:
It's all there just need to put the tables together after some data manipulation. Note though some of those links don't have all 3 tables. I also changed how you iterate through the list. No need to create a dataframe.
Also be careful with your variables. I think you meant dfs = df1.append(df2)
. But I adjusted the code a bit.
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Define all urls required for data scrapping from the FT Website - if new fund is added simply add the appropriate Fund ID to the List
id_list = ['LU1076093779:EUR','LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']
urls = ['https://markets.ft.com/data/funds/tearsheet/holdings?s=' x for x in id_list]
for url in urls:
print(url)
ISIN = url.split('=')[-1].replace(':', '_')
ISIN = ISIN[:-4]
df_list = []
r = requests.get(url).content
soup = BeautifulSoup(r, 'html.parser')
all_colspan = soup.find_all(attrs={'colspan':True})
for colspan in all_colspan:
colspan.attrs['colspan'] = colspan.attrs['colspan'].replace('%', '')
dfs = pd.read_html(str(soup))
for df in dfs:
cols = df.columns
drop_cols = ['Category average', '% Short', '% Long']
if 'Type' in cols or 'Sector' in cols:
df = df.rename(columns={'% Net assets': ISIN,
'Sector':'Type'})
df = df.drop([x for x in drop_cols if x in df.columns], axis=1)
df_list.append(df)
result = pd.concat(df_list)
print(result)