Home > Software engineering >  Need to split a column but only removing the character
Need to split a column but only removing the character

Time:12-28

Good morning. Below the first 20 rows of my df and my code.

When I try to split by the '<' to remove the strong tag from the link, split only removes the character, split('<')[0] returns a KeyError.

Any ideas how to get this to work?

First desired link:

http://africa.espn.com/college-sports/football/recruiting/player/_/id/222687/kayvon-thibodeaux

    0
0   <a  href="http://africa.espn.com/college-sports/football/recruiting/rankings">Back to Ranking Index</a>
1   <a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/222687/kayvon-thibodeaux" name=""></a>
2   <a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/222687/kayvon-thibodeaux"><strong>Kayvon Thibodeaux</strong></a>
3   <a href="http://insider.espn.com/college-sports/football/recruiting/player/evaluation/_/id/222687/kayvon-thibodeaux">Scouts Report</a>
4   <a href="http://africa.espn.com/college-sports/football/recruiting/playerrankings/_/view/rn300/sort/rank/class/2019"><img border="0"  src="https://a.espncdn.com/i/recruiting/logos/2012/sml/rn-300_sml.png" title="ESPN 300"/></a>
5   <a href="http://africa.espn.com/college-sports/football/recruiting/school/_/id/2483/class/2019/oregon-ducks"><img  src="https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/2483.png?w=110&amp;h=110&amp;transparent=true" style="width: 50px"/></a>
6   <a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/226752/nolan-smith" name=""></a>
7   <a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/226752/nolan-smith"><strong>Nolan Smith</strong></a>
8   <a href="http://insider.espn.com/college-sports/football/recruiting/player/evaluation/_/id/226752/nolan-smith">Scouts Report</a>
9   <a href="http://africa.espn.com/college-sports/football/recruiting/playerrankings/_/view/rn300/sort/rank/class/2019"><img border="0"  src="https://a.espncdn.com/i/recruiting/logos/2012/sml/rn-300_sml.png" title="ESPN 300"/></a>
10  <a href="http://africa.espn.com/college-sports/football/recruiting/school/_/id/61/class/2019/georgia-bulldogs"><img  src="https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/61.png?w=110&amp;h=110&amp;transparent=true" style="width: 50px"/></a>
11  <a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/216987/kenyon-green" name=""></a>
12  <a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/216987/kenyon-green"><strong>Kenyon Green</strong></a>
13  <a href="http://insider.espn.com/college-sports/football/recruiting/player/evaluation/_/id/216987/kenyon-green">Scouts Report</a>
14  <a href="http://africa.espn.com/college-sports/football/recruiting/playerrankings/_/view/rn300/sort/rank/class/2019"><img border="0"  src="https://a.espncdn.com/i/recruiting/logos/2012/sml/rn-300_sml.png" title="ESPN 300"/></a>
15  <a href="http://africa.espn.com/college-sports/football/recruiting/school/_/id/245/class/2019/texas-aggies"><img  src="https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/245.png?w=110&amp;h=110&amp;transparent=true" style="width: 50px"/></a>
16  <a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/222156/evan-neal" name=""></a>
17  <a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/222156/evan-neal"><strong>Evan Neal</strong></a>
18  <a href="http://insider.espn.com/college-sports/football/recruiting/player/evaluation/_/id/222156/evan-neal">Scouts Report</a>
19  <a href="http://africa.espn.com/college-sports/football/recruiting/playerrankings/_/view/rn300/sort/rank/class/2019"><img border="0"  src="https://a.espncdn.com/i/recruiting/logos/2012/sml/rn-300_sml.png" title="ESPN 300"/></a>
20  <a href="http://africa.espn.com/college-sports/football/recruiting/school/_/id/333/class/2019/alabama-crimson-tide"><img  src="https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/333.png?w=110&amp;h=110&amp;transparent=true" style="width: 50px"/></a>



#players.to_excel('Player_Links.xlsx')
players = pd.read_excel('Player_Links.xlsx')
players['Links'] = players.iloc[:,1]
players = players[players['Links'].str.contains('strong')]
players['Links'] = players['Links'].str.replace('<a href="','')
players['Links'] = players['Links'].str.split('<')
print(players)

CodePudding user response:

Filter you datframe to get the rows with the <strong> tags. Then just us BeautifulSoup to parse the html. Use it in lambda function:

from bs4 import BeautifulSoup
import pandas as pd


df = pd.DataFrame( [   
['<a  href="http://africa.espn.com/college-sports/football/recruiting/rankings">Back to Ranking Index</a>'],
['<a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/222687/kayvon-thibodeaux" name=""></a>'],
['<a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/222687/kayvon-thibodeaux"><strong>Kayvon Thibodeaux</strong></a>'],
['<a href="http://insider.espn.com/college-sports/football/recruiting/player/evaluation/_/id/222687/kayvon-thibodeaux">Scouts Report</a>'],
['<a href="http://africa.espn.com/college-sports/football/recruiting/playerrankings/_/view/rn300/sort/rank/class/2019"><img border="0"  src="https://a.espncdn.com/i/recruiting/logos/2012/sml/rn-300_sml.png" title="ESPN 300"/></a>'],
['<a href="http://africa.espn.com/college-sports/football/recruiting/school/_/id/2483/class/2019/oregon-ducks"><img  src="https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/2483.png?w=110&amp;h=110&amp;transparent=true" style="width: 50px"/></a>'],
['<a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/226752/nolan-smith" name=""></a>']],
    columns=[0])

df_filter = df[df[0].str.contains('<strong>')]

df_filter[0] = df_filter[0].apply(lambda row: BeautifulSoup(row, 'html.parser').find('a')['href'])

Output:

This leaves us with this from the sample set I used above:

print(df_filter.to_string())
                                                                                                0
2  http://africa.espn.com/college-sports/football/recruiting/player/_/id/222687/kayvon-thibodeaux

CodePudding user response:

You could also do everything with regular expressions:

players = pd.read_excel('Player_Links.xlsx')
players['Links'] = players.iloc[:,1]
regex = r"(http:.*)\">.*<strong>"
players = players.Links.str.findall(regex)
# only keep the rows for which the regex hit
players = players[players.apply(lambda li: len(li) == 1)]
# flatten the list
players = players.apply(lambda li: li[0])
print(players)
  • Related