Is there an efficient way to calculate when a record was replaced by another?-CodePudding

I am going to use a soccer analogy to illustrate the problem. I have a table representing players in a soccer game.

player | position          | start minute
------------------------------
Bob    | keeper            | 0
Pedro  | Center Midfielder | 0
Joe    | Striker           | 0
Tim    | Center Midfielder | 20

I want to add a column "end minute" for when they were substituted. In the table above, Pedro was substituted out of the "Center Midfielder" position at minute 20 by Tim. You know this because Tim started at the position after Pedro. If nobody replaces them then they play until the end and the "end minute" = 90. The difference between "start minute" and "end minute" is the "play duration" for each player.

I hope this is clear. I am unable to find a clean way to do this in pandas. In the above example there was only one substitution so you can "brute force" the problem. In principle, I need code that works for an unlimited number of substitutions and this is where I get stuck.

CodePudding user response：

One approach could be as follows:

Data

import pandas as pd

# adding some subs to get a more informative example
data = {'player': {0: 'Bob', 1: 'Pedro', 2: 'Joe', 3: 'Tim', 4: 'Keith',
                   5: 'Leo'}, 
        'position': {0: 'keeper', 1: 'Center Midfielder', 2: 'Striker', 
                     3: 'Center Midfielder', 4: 'Center Midfielder',
                     5: 'Striker'}, 
        'start minute': {0: 0, 1: 0, 2: 0, 3: 20, 4: 85, 5: 70}}
df = pd.DataFrame(data)

  player           position  start minute
0    Bob             keeper             0 # 90 mins, no sub
1  Pedro  Center Midfielder             0 # 20 mins, repl by Tim
2    Joe            Striker             0 # 70 mins, repl by Leo
3    Tim  Center Midfielder            20 # 65 mins, repl by Keith
4  Keith  Center Midfielder            85 # 5 mins, no sub
5    Leo            Striker            70 # 20 mins, no sub

Code

df['end minute'] = df.groupby('position').shift(-1)['start minute'].fillna(90)
df['play duration'] = df['end minute'].sub(df['start minute'])

print(df)

  player           position  start minute  end minute  play duration
0    Bob             keeper             0        90.0           90.0
1  Pedro  Center Midfielder             0        20.0           20.0
2    Joe            Striker             0        70.0           70.0
3    Tim  Center Midfielder            20        85.0           65.0
4  Keith  Center Midfielder            85        90.0            5.0
5    Leo            Striker            70        90.0           20.0

Explanation

Use df.groupby on column position and shift by -1 periods.
Remaining NaN values will be for players who were still on the pitch at the end of the game, so let's chain Series.fillna with value 90.
Finally, add a column play duration, with Series.sub applied to the end and start column.

N.B. The above assumes that the players are listed in chronological order per position. If you're not sure about this, first use:

df.sort_values(by=['position','start minute'], inplace=True)

and then at the end use the following to get the original index back:

df.sort_index(inplace=True)