I am going to use a soccer analogy to illustrate the problem. I have a table representing players in a soccer game.
player | position | start minute
------------------------------
Bob | keeper | 0
Pedro | Center Midfielder | 0
Joe | Striker | 0
Tim | Center Midfielder | 20
I want to add a column "end minute" for when they were substituted. In the table above, Pedro was substituted out of the "Center Midfielder" position at minute 20 by Tim. You know this because Tim started at the position after Pedro. If nobody replaces them then they play until the end and the "end minute" = 90. The difference between "start minute" and "end minute" is the "play duration" for each player.
I hope this is clear. I am unable to find a clean way to do this in pandas. In the above example there was only one substitution so you can "brute force" the problem. In principle, I need code that works for an unlimited number of substitutions and this is where I get stuck.
CodePudding user response:
One approach could be as follows:
Data
import pandas as pd
# adding some subs to get a more informative example
data = {'player': {0: 'Bob', 1: 'Pedro', 2: 'Joe', 3: 'Tim', 4: 'Keith',
5: 'Leo'},
'position': {0: 'keeper', 1: 'Center Midfielder', 2: 'Striker',
3: 'Center Midfielder', 4: 'Center Midfielder',
5: 'Striker'},
'start minute': {0: 0, 1: 0, 2: 0, 3: 20, 4: 85, 5: 70}}
df = pd.DataFrame(data)
player position start minute
0 Bob keeper 0 # 90 mins, no sub
1 Pedro Center Midfielder 0 # 20 mins, repl by Tim
2 Joe Striker 0 # 70 mins, repl by Leo
3 Tim Center Midfielder 20 # 65 mins, repl by Keith
4 Keith Center Midfielder 85 # 5 mins, no sub
5 Leo Striker 70 # 20 mins, no sub
Code
df['end minute'] = df.groupby('position').shift(-1)['start minute'].fillna(90)
df['play duration'] = df['end minute'].sub(df['start minute'])
print(df)
player position start minute end minute play duration
0 Bob keeper 0 90.0 90.0
1 Pedro Center Midfielder 0 20.0 20.0
2 Joe Striker 0 70.0 70.0
3 Tim Center Midfielder 20 85.0 65.0
4 Keith Center Midfielder 85 90.0 5.0
5 Leo Striker 70 90.0 20.0
Explanation
- Use
df.groupby
on columnposition
andshift
by-1 periods
. - Remaining
NaN
values will be for players who were still on the pitch at the end of the game, so let's chainSeries.fillna
with value90
. - Finally, add a column
play duration
, withSeries.sub
applied to theend
andstart
column.
N.B. The above assumes that the players are listed in chronological order per position. If you're not sure about this, first use:
df.sort_values(by=['position','start minute'], inplace=True)
and then at the end use the following to get the original index back:
df.sort_index(inplace=True)