I have a Pandas Dataframe df
that details Names of players that play a game. The Dataframe has 2 columns of 'Date' they played a game and their name, sorted by Date.
Date | Name |
---|---|
1993-03-28 | Tom |
1993-03-28 | Joe |
1993-03-29 | Tom |
1993-03-30 | Joe |
What I am trying to accomplish is to time-efficiently calculate the previous number of games each player has played before they play the upcoming game that day.
For the example Dataframe above, calculating the players previous number of games would start at 0 and look like follows.
Date | Name | Previous Games |
---|---|---|
1993-03-28 | Tom | 0 |
1993-03-28 | Joe | 0 |
1993-03-29 | Tom | 1 |
1993-03-30 | Joe | 1 |
I have tried the following codes and although they have delivered the correct result, they took many days for my computer to run.
Attempt 1:
for i in range(0, len(df) ):
df['Previous Games'][i] = len( df[ (df['Name'] == df['Name'][i]) & (df['Date'] < df['Date'][i]) ] )
Attempt 2:
df['Previous Games'] = [ len( df[ (df['Name'] == df['Name'][i]) & (df['Date'] < df['Date'][i]) ] ) for i in range(0, len(df) ) ]
Although Attempt 2 was slightly quicker, it was still not time-efficient so I need help in finding a faster method.
CodePudding user response:
Any time you write "for" and "pandas" anywhere close together you are probably doing something wrong.
It seems to me you want the cumulative count:
df["prev_games"] = df.sort_values('Date').groupby('Name').cumcount()
CodePudding user response:
Yes, a quicker way should be to avoid explicit for loops. You could group the dataframe for each name, and then .rank
the rows by "Date":
>>> df["Previous Games"] = df.groupby("Name")["Date"].rank("dense") - 1
The -1 was added to start from 0.
CodePudding user response:
This is a pandas question rather than a python one.
There are several options to avoid a for cycle when dealing with pandas dataframes. The most immediate is the following one:
# To recreate a dummy dataset:
se = pd.date_range(start='2016-01-01', end='2020-12-31', freq='D')
df = pd.DataFrame({"Date": se, "Name": list(np.random.choice(("joe", "bob", "alice"), len(se)))})
# To add the previous games column
df['Previous Games'] = df.apply(lambda row: ((row["Date"] > df["Date"]) * (row["Name"] == df["Name"])).sum(), axis=1)