How to use pd.read_html for scraping HTML tables with % values-CodePudding

I'm trying to scrape the 'Profile and investment' table from the following url: https://markets.ft.com/data/funds/tearsheet/summary?s=LU0526609390:EUR, using the following code:

import requests
import pandas as pd

# Define all urls required for data scraping from the FT Website - if new fund is added simply add the appropriate Fund ID to the List
List = ['LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']
df = pd.DataFrame(List, columns=['List'])
urls = 'https://markets.ft.com/data/funds/tearsheet/summary?s='  df['List']

for url in urls:
    r = requests.get(url).content
    df = pd.read_html(r)[0]
    print (df)

However, when I use the pd.read_html function, I get the following error code: ValueError: invalid literal for int() with base 10: '100%', because the table has entries in %. Is there a way to make pandas accept % values? I'm new to python and pandas so any help would be greatly appreciated!

My required output is to get a table with the following format:

    Fund_ID          Fund_type     Income_treatment     Morningstar category ......
LU0526609390:EUR        ...              ...                    ....
IE00BHBX0Z19:EUR        ...              ...                    ....
LU1076093779:EUR        ...              ...                    ....
LU1116896363:EUR        ...              ...                    ....

CodePudding user response：

The issue is the site uses the 'colspan' attribute and uses % instead of with an int. As AsishM mentions in the comments, these should be in the form of an int, and while some browsers will accommodate for that, pandas is specifically wanting it to be the appropriate syntax of

<td colspan="number">

Ways to approach this is:

Use BeautifulSoup to fix those attributes
Since it's not within the table you actually want to parse, use BeautifulSoup to grab that first table and then don't need to worry about it.
See if the table has a specific attribute and could add that to the .read_html() as a parameter so it grabs only that specific table.

I chose option 2 here:

import requests
import pandas as pd
from bs4 import BeautifulSoup

# Define all urls required for data scrapping from the FT Website - if new fund is added simply add the appropriate Fund ID to the List
List = ['LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']
df = pd.DataFrame(List, columns=['List'])
urls = 'https://markets.ft.com/data/funds/tearsheet/summary?s='  df['List']

results = pd.DataFrame()
for url in urls:
    print(url)
    r = requests.get(url).content
    soup = BeautifulSoup(r, 'html.parser')
    table = soup.find('table')
    df = pd.read_html(str(table), index_col=0)[0].T
    results = results.append(df, sort=False)
    
results = results.reset_index(drop=True)
print (results)

Output:

print(results.to_string())
0                      Fund type Income treatment       Morningstar category IMA sector  Launch date Price currency    Domicile          ISIN                                                        Manager & start date                            Investment style (bonds)                 Investment style (stocks)
0                          SICAV           Income   Global Bond - EUR Hedged         --  06 Aug 2010            GBP  Luxembourg  LU0526609390  Jonathan Gregory01 Nov 2012Vivek Acharya09 Dec 2015Simon Foster01 Nov 2012                                                 NaN                                       NaN
1  Open Ended Investment Company           Income       EUR Diversified Bond         --  21 Feb 2014            EUR     Ireland  IE00BHBX0Z19                         Lorenzo Pagani12 May 2017Konstantin Veit01 Jul 2019  Credit Quality: HighInterest-Rate Sensitivity: Mod                                       NaN
2                          SICAV           Income  Eurozone Large-Cap Equity         --  11 Jul 2014            GBP  Luxembourg  LU1076093779                                                                         NaN                                                 NaN  Market Cap: LargeInvestment Style: Blend
3                          SICAV           Income          EUR Flexible Bond         --  01 Dec 2014            EUR  Luxembourg  LU1116896363                                                                         NaN                                                 NaN                                       NaN

Here's how you could use BeautifulSoup to fix those colspan attributes.

import requests
import pandas as pd
from bs4 import BeautifulSoup

# Define all urls required for data scrapping from the FT Website - if new fund is added simply add the appropriate Fund ID to the List
List = ['LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']
df = pd.DataFrame(List, columns=['List'])
urls = 'https://markets.ft.com/data/funds/tearsheet/summary?s='  df['List']


for url in urls:
    print(url)
    r = requests.get(url).content
    soup = BeautifulSoup(r, 'html.parser')
    
    all_colspan = soup.find_all(attrs={'colspan':True})
    for colspan in all_colspan:
        colspan.attrs['colspan'] = colspan.attrs['colspan'].replace('%', '')
        
    df = pd.read_html(str(soup))