Home > Blockchain >  Separating column string values with varying delimiters
Separating column string values with varying delimiters

Time:09-12

I have a column in a dataframe that I want to split into two columns. The values in the column are strings with a players' name followed by their position. Because players have different numbers of names, this becomes a bigger issue.

For example:

  • 1 name: Jorginho Defensive Midfield
  • 2 names: Heung-min Son Left Winger
  • 3 names: Bilal El Khannouss Attacking Midfield

The desired output would be:

Player              Position
Jorginho            Defensive Midfield
Heung-min Son       Left Winger
Bilal El Khannouss  Attacking Midfield

I believe this can be done by listing the player positions, however I don't know how to approach that problem. I tried separating using split() with a space character as the delimiter, but that doesn't work unfortunately.

import pandas as pd
df = pd.DataFrame({'Player': ['Richarlison Centre-Forward',
                              'Heung-min Son Left Winger',
                              'Harry Wilson Right Winger',
                              'Bilal El Khannouss Attacking Midfield',
                              'Eduardo Camavinga Central Midfield',
                              'Jorginho Defensive Midfield',
                              'Lewis Patterson Centre-Back',
                              'Layvin Kurzawa Left-Back',
                              'Kyle Walker Right-Back',
                              'Jordan Pickford Goalkeeper']})

positions = ['Centre-Forward', 'Left Winger', 'Right Winger',
             'Attacking Midfield', 'Central Midfield', 'Defensive Midfield',
             'Centre-Back', 'Left-Back', 'Right-Back', 'Goalkeeper']

Is this possible to do?

CodePudding user response:

You can craft a regex.

import re
regex = '|'.join(map(re.escape, positions))

df['Player'].str.extract(fr'(.*)\s*({regex})')

NB. changed 'Central Midfielder' to 'Central Midfield' in the list of positions.

Another approach that does not require any list, would be to extract the last 2 words (either separated by spaces, or a dash):

df['Player'].str.extract(r'(.*)\s(\w (?:-|\s )\w )')

output:

                     0                   1
0         Richarlison       Centre-Forward
1       Heung-min Son          Left Winger
2        Harry Wilson         Right Winger
3  Bilal El Khannouss   Attacking Midfield
4   Eduardo Camavinga     Central Midfield
5            Jorginho   Defensive Midfield
6     Lewis Patterson          Centre-Back
7      Layvin Kurzawa            Left-Back
8         Kyle Walker           Right-Back
9     Jordan Pickford           Goalkeeper

CodePudding user response:

You can use re.sub() function to delete the positions from the player list and then create another list of player. Here the positions are indexed with player so I don't use for loop for matching the position.

import re
player = [item for item in range(len(positions))]
for i in range(len(positions)):
    player[i]=re.sub(positions[i],'',str(df['Player'][i]))
d_frame = {'Player':player,'Position':positions}
df = pd.DataFrame(d_frame)

You can use it to make the new dataframe

  • Related