I have a column in a dataframe that I want to split into two columns. The values in the column are strings with a players' name followed by their position. Because players have different numbers of names, this becomes a bigger issue.
For example:
- 1 name:
Jorginho Defensive Midfield
- 2 names:
Heung-min Son Left Winger
- 3 names:
Bilal El Khannouss Attacking Midfield
The desired output would be:
Player Position
Jorginho Defensive Midfield
Heung-min Son Left Winger
Bilal El Khannouss Attacking Midfield
I believe this can be done by listing the player positions, however I don't know how to approach that problem. I tried separating using split()
with a space character as the delimiter, but that doesn't work unfortunately.
import pandas as pd
df = pd.DataFrame({'Player': ['Richarlison Centre-Forward',
'Heung-min Son Left Winger',
'Harry Wilson Right Winger',
'Bilal El Khannouss Attacking Midfield',
'Eduardo Camavinga Central Midfield',
'Jorginho Defensive Midfield',
'Lewis Patterson Centre-Back',
'Layvin Kurzawa Left-Back',
'Kyle Walker Right-Back',
'Jordan Pickford Goalkeeper']})
positions = ['Centre-Forward', 'Left Winger', 'Right Winger',
'Attacking Midfield', 'Central Midfield', 'Defensive Midfield',
'Centre-Back', 'Left-Back', 'Right-Back', 'Goalkeeper']
Is this possible to do?
CodePudding user response:
You can craft a regex.
import re
regex = '|'.join(map(re.escape, positions))
df['Player'].str.extract(fr'(.*)\s*({regex})')
NB. changed 'Central Midfielder'
to 'Central Midfield'
in the list of positions.
Another approach that does not require any list, would be to extract the last 2 words (either separated by spaces, or a dash):
df['Player'].str.extract(r'(.*)\s(\w (?:-|\s )\w )')
output:
0 1
0 Richarlison Centre-Forward
1 Heung-min Son Left Winger
2 Harry Wilson Right Winger
3 Bilal El Khannouss Attacking Midfield
4 Eduardo Camavinga Central Midfield
5 Jorginho Defensive Midfield
6 Lewis Patterson Centre-Back
7 Layvin Kurzawa Left-Back
8 Kyle Walker Right-Back
9 Jordan Pickford Goalkeeper
CodePudding user response:
You can use re.sub() function to delete the positions from the player list and then create another list of player. Here the positions are indexed with player so I don't use for loop for matching the position.
import re
player = [item for item in range(len(positions))]
for i in range(len(positions)):
player[i]=re.sub(positions[i],'',str(df['Player'][i]))
d_frame = {'Player':player,'Position':positions}
df = pd.DataFrame(d_frame)
You can use it to make the new dataframe