Separating column string values with varying delimiters-CodePudding

I have a column in a dataframe that I want to split into two columns. The values in the column are strings with a players' name followed by their position. Because players have different numbers of names, this becomes a bigger issue.

For example:

1 name: Jorginho Defensive Midfield
2 names: Heung-min Son Left Winger
3 names: Bilal El Khannouss Attacking Midfield

The desired output would be:

Player              Position
Jorginho            Defensive Midfield
Heung-min Son       Left Winger
Bilal El Khannouss  Attacking Midfield

I believe this can be done by listing the player positions, however I don't know how to approach that problem. I tried separating using split() with a space character as the delimiter, but that doesn't work unfortunately.

import pandas as pd
df = pd.DataFrame({'Player': ['Richarlison Centre-Forward',
                              'Heung-min Son Left Winger',
                              'Harry Wilson Right Winger',
                              'Bilal El Khannouss Attacking Midfield',
                              'Eduardo Camavinga Central Midfield',
                              'Jorginho Defensive Midfield',
                              'Lewis Patterson Centre-Back',
                              'Layvin Kurzawa Left-Back',
                              'Kyle Walker Right-Back',
                              'Jordan Pickford Goalkeeper']})

positions = ['Centre-Forward', 'Left Winger', 'Right Winger',
             'Attacking Midfield', 'Central Midfield', 'Defensive Midfield',
             'Centre-Back', 'Left-Back', 'Right-Back', 'Goalkeeper']

Is this possible to do?

CodePudding user response：

You can craft a regex.

import re
regex = '|'.join(map(re.escape, positions))

df['Player'].str.extract(fr'(.*)\s*({regex})')

NB. changed 'Central Midfielder' to 'Central Midfield' in the list of positions.

Another approach that does not require any list, would be to extract the last 2 words (either separated by spaces, or a dash):

df['Player'].str.extract(r'(.*)\s(\w (?:-|\s )\w )')

output:

                     0                   1
0         Richarlison       Centre-Forward
1       Heung-min Son          Left Winger
2        Harry Wilson         Right Winger
3  Bilal El Khannouss   Attacking Midfield
4   Eduardo Camavinga     Central Midfield
5            Jorginho   Defensive Midfield
6     Lewis Patterson          Centre-Back
7      Layvin Kurzawa            Left-Back
8         Kyle Walker           Right-Back
9     Jordan Pickford           Goalkeeper

CodePudding user response：

You can use re.sub() function to delete the positions from the player list and then create another list of player. Here the positions are indexed with player so I don't use for loop for matching the position.

import re
player = [item for item in range(len(positions))]
for i in range(len(positions)):
    player[i]=re.sub(positions[i],'',str(df['Player'][i]))
d_frame = {'Player':player,'Position':positions}
df = pd.DataFrame(d_frame)

You can use it to make the new dataframe