Good morning. Below the first 20 rows of my df and my code.
When I try to split by the '<' to remove the strong tag from the link, split
only removes the character, split('<')[0]
returns a KeyError.
Any ideas how to get this to work?
First desired link:
http://africa.espn.com/college-sports/football/recruiting/player/_/id/222687/kayvon-thibodeaux
0
0 <a href="http://africa.espn.com/college-sports/football/recruiting/rankings">Back to Ranking Index</a>
1 <a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/222687/kayvon-thibodeaux" name=""></a>
2 <a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/222687/kayvon-thibodeaux"><strong>Kayvon Thibodeaux</strong></a>
3 <a href="http://insider.espn.com/college-sports/football/recruiting/player/evaluation/_/id/222687/kayvon-thibodeaux">Scouts Report</a>
4 <a href="http://africa.espn.com/college-sports/football/recruiting/playerrankings/_/view/rn300/sort/rank/class/2019"><img border="0" src="https://a.espncdn.com/i/recruiting/logos/2012/sml/rn-300_sml.png" title="ESPN 300"/></a>
5 <a href="http://africa.espn.com/college-sports/football/recruiting/school/_/id/2483/class/2019/oregon-ducks"><img src="https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/2483.png?w=110&h=110&transparent=true" style="width: 50px"/></a>
6 <a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/226752/nolan-smith" name=""></a>
7 <a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/226752/nolan-smith"><strong>Nolan Smith</strong></a>
8 <a href="http://insider.espn.com/college-sports/football/recruiting/player/evaluation/_/id/226752/nolan-smith">Scouts Report</a>
9 <a href="http://africa.espn.com/college-sports/football/recruiting/playerrankings/_/view/rn300/sort/rank/class/2019"><img border="0" src="https://a.espncdn.com/i/recruiting/logos/2012/sml/rn-300_sml.png" title="ESPN 300"/></a>
10 <a href="http://africa.espn.com/college-sports/football/recruiting/school/_/id/61/class/2019/georgia-bulldogs"><img src="https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/61.png?w=110&h=110&transparent=true" style="width: 50px"/></a>
11 <a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/216987/kenyon-green" name=""></a>
12 <a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/216987/kenyon-green"><strong>Kenyon Green</strong></a>
13 <a href="http://insider.espn.com/college-sports/football/recruiting/player/evaluation/_/id/216987/kenyon-green">Scouts Report</a>
14 <a href="http://africa.espn.com/college-sports/football/recruiting/playerrankings/_/view/rn300/sort/rank/class/2019"><img border="0" src="https://a.espncdn.com/i/recruiting/logos/2012/sml/rn-300_sml.png" title="ESPN 300"/></a>
15 <a href="http://africa.espn.com/college-sports/football/recruiting/school/_/id/245/class/2019/texas-aggies"><img src="https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/245.png?w=110&h=110&transparent=true" style="width: 50px"/></a>
16 <a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/222156/evan-neal" name=""></a>
17 <a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/222156/evan-neal"><strong>Evan Neal</strong></a>
18 <a href="http://insider.espn.com/college-sports/football/recruiting/player/evaluation/_/id/222156/evan-neal">Scouts Report</a>
19 <a href="http://africa.espn.com/college-sports/football/recruiting/playerrankings/_/view/rn300/sort/rank/class/2019"><img border="0" src="https://a.espncdn.com/i/recruiting/logos/2012/sml/rn-300_sml.png" title="ESPN 300"/></a>
20 <a href="http://africa.espn.com/college-sports/football/recruiting/school/_/id/333/class/2019/alabama-crimson-tide"><img src="https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/333.png?w=110&h=110&transparent=true" style="width: 50px"/></a>
#players.to_excel('Player_Links.xlsx')
players = pd.read_excel('Player_Links.xlsx')
players['Links'] = players.iloc[:,1]
players = players[players['Links'].str.contains('strong')]
players['Links'] = players['Links'].str.replace('<a href="','')
players['Links'] = players['Links'].str.split('<')
print(players)
CodePudding user response:
Filter you datframe to get the rows with the <strong>
tags. Then just us BeautifulSoup to parse the html. Use it in lambda function:
from bs4 import BeautifulSoup
import pandas as pd
df = pd.DataFrame( [
['<a href="http://africa.espn.com/college-sports/football/recruiting/rankings">Back to Ranking Index</a>'],
['<a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/222687/kayvon-thibodeaux" name=""></a>'],
['<a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/222687/kayvon-thibodeaux"><strong>Kayvon Thibodeaux</strong></a>'],
['<a href="http://insider.espn.com/college-sports/football/recruiting/player/evaluation/_/id/222687/kayvon-thibodeaux">Scouts Report</a>'],
['<a href="http://africa.espn.com/college-sports/football/recruiting/playerrankings/_/view/rn300/sort/rank/class/2019"><img border="0" src="https://a.espncdn.com/i/recruiting/logos/2012/sml/rn-300_sml.png" title="ESPN 300"/></a>'],
['<a href="http://africa.espn.com/college-sports/football/recruiting/school/_/id/2483/class/2019/oregon-ducks"><img src="https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/2483.png?w=110&h=110&transparent=true" style="width: 50px"/></a>'],
['<a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/226752/nolan-smith" name=""></a>']],
columns=[0])
df_filter = df[df[0].str.contains('<strong>')]
df_filter[0] = df_filter[0].apply(lambda row: BeautifulSoup(row, 'html.parser').find('a')['href'])
Output:
This leaves us with this from the sample set I used above:
print(df_filter.to_string())
0
2 http://africa.espn.com/college-sports/football/recruiting/player/_/id/222687/kayvon-thibodeaux
CodePudding user response:
You could also do everything with regular expressions:
players = pd.read_excel('Player_Links.xlsx')
players['Links'] = players.iloc[:,1]
regex = r"(http:.*)\">.*<strong>"
players = players.Links.str.findall(regex)
# only keep the rows for which the regex hit
players = players[players.apply(lambda li: len(li) == 1)]
# flatten the list
players = players.apply(lambda li: li[0])
print(players)