Home > Net >  Python Generate new columns based on string condition
Python Generate new columns based on string condition

Time:05-09

I have the following DF:

| Fecha      | Partido                 | Equipo  |  xG  |  xGA |
|------------|-------------------------|---------|------|------|
| 2022-05-01 | América - Cruz Azul 0:0 | América | 1.53 | 0.45 |
| 2022-05-01 | Leon - América 2:0      | América | 1.70 | 0.35 |

I want to create three new columns based on the Partido column where the first team goes to a new column named Home, the second team to a column named Visitor and the score to a column named Score.

Desired Output:

| Fecha      | Partido                 | Equipo  |  xG  |  xGA | Home    | Away       | Score |
|------------|-------------------------|---------|------|------|-------- |------------|-------|
| 2022-05-01 | América - Cruz Azul 0:0 | América | 1.53 | 0.45 | América | Cruz Azul  | 0:0   |
| 2022-05-01 | Leon - América 2:0      | América | 1.70 | 0.35 | Leon    | América    | 2:0   |

I have tried splitting with delimiter but since some teams have two words in their names it doesn't work.

CodePudding user response:

It is quite simple using str.extract and a regex:

regex = r'([^-] )\s*-\s*([^-] ) (\d :\d )'
df[['Home', 'Away', 'Score']] = df['Partido'].str.extract(regex)

output:

        Fecha                  Partido   Equipo    xG   xGA      Home       Away Score
0  2022-05-01  América - Cruz Azul 0:0  América  1.53  0.45  América   Cruz Azul   0:0
1  2022-05-01       Leon - América 2:0  América  1.70  0.35     Leon     América   2:0

regex demo

If you don't want to modify the original DataFrame, you can also use named capturing groups to directly set the column names:

regex = r'(?P<Home>[^-] )\s*-\s*(?P<Away>[^-] ) (?P<Score>\d :\d )'
df2 = df['Partido'].str.extract(regex)

#        Home       Away Score
# 0  América   Cruz Azul   0:0
# 1     Leon     América   2:0

# OR
df2 = df.join(df['Partido'].str.extract(regex))

# same a first output

CodePudding user response:

You might use rsplit to get teams and score then split to get home and away following way

import pandas as pd
df = pd.DataFrame({'Partido':['América - Cruz Azul 0:0','Leon - América 2:0']})
df[['Teams', 'Score']] = df.Partido.str.rsplit(' ', 1, expand=True)
df[['Home', 'Away']] = df.pop('Teams').str.split(' - ', expand=True)
print(df)

output

                   Partido Score     Home       Away
0  América - Cruz Azul 0:0   0:0  América  Cruz Azul
1       Leon - América 2:0   2:0     Leon    América

Note: I simplified input df for brevity sake. 1 in rsplit mean split once, i.e. at rightmost space. .pop does remove and return selected column, I use it as Teams is only temporary.

CodePudding user response:

One option with a split (regex):


regex = r"(\d :\d )|\s -\s "
outcome = (df.Partido
             .str.split(regex, expand = True)
             .dropna(how ='all', axis = 1)
             .iloc[:, :-1]
           )

df[['Home', 'Away', 'Score']] = outcome

df

        Fecha                  Partido   Equipo    xG   xGA     Home        Away Score
0  2022-05-01  América - Cruz Azul 0:0  América  1.53  0.45  América  Cruz Azul    0:0
1  2022-05-01       Leon - América 2:0  América  1.70  0.35     Leon    América    2:0
  • Related