I have two data frames and I want to use the date in one dataframe to calculate some fields from another data dataframe. Is there a better way to do this besides how I have done this below? In a toy example I have a dataframe of players and their injury datetime, and in another I have their games with then they played. I want to calculate if the player played a game before and after their injury and return a flag stating they did and the name of the game they played. Then I want to put this result back into the first dataframe.
It should look something like this:
df_A
id texts injury_date
0 1 text for person 1 2020-01-01
1 2 text for person 2 2020-02-01
2 1 text for person 1 AGAIN 2021-05-01
3 4 text for person 4 2022-01-01
df_B:
id games played_on
0 2 soccer 2019-01-01
1 1 football 2020-02-01
2 1 tennis 2021-05-01
and want to get:
id played after injury played before injury games played after games played before
0 1 True False [football, tennis] []
1 2 False True [] [soccer]
2 1 False True [] [football, tennis]
3 4 False False [] []
and merging it like this
texts id injury_date played after injury played before injury games played after games played before
0 text for person 1 1 2020-01-01 True False [football, tennis] []
1 text for person 1 AGAIN 1 2021-05-01 True False [] [football, tennis]
2 text for person 2 2 2020-02-01 False True [] [soccer]
3 text for person 4 4 2022-01-01 False False [] []
Code:
df_A = pd.DataFrame({"id": [1,2,1,4], "texts":['text for person 1', 'text for person 2', 'text for person 1 AGAIN', 'text for person 4'], "injury_date": [pd.to_datetime('1/1/2020'),pd.to_datetime('2/1/2020'),pd.to_datetime('5/1/2021'),pd.to_datetime('1/1/2022')]})
df_B = pd.DataFrame({"id": [2,1,1], "games":['soccer', 'football', 'tennis'], 'played_on': [pd.to_datetime('1/1/2019'),pd.to_datetime('2/1/2020'),pd.to_datetime('5/1/2021')]})
def played_injured(x):
played_after = df_B[(df_B['id'] == x['id']) & (df_B['played_on'] > x['injury_date'])]# respective to the time for the actual row
played_before = df_B[(df_B['id'] == x['id']) & (df_B['played_on'] <= x['injury_date'])]
played_after = played_after.groupby('games',as_index=False).first()
played_before = played_before.groupby('games',as_index=False).first()
#print(~played_after.empty)
return pd.Series({"id":x['id'], "played after injury": len(played_after.index) > 0,"played before injury": len(played_before.index) > 0, "games played after":played_after['games'].tolist(), "games played before":played_before['games'].tolist()})
pd.merge(df_A, df_A.apply(played_injured, axis=1), on=["id"]).groupby('texts', as_index=False).first()
CodePudding user response:
I liked your code, and couldn't find anything to make it more efficient. The problem with the output was that it was merging on "id" only, which meant that the first id=1
function output was duplicated. To solve this, instead of merging the output with the original DataFrame, you can return the initial columns too.
def played_injured(x):
print(x)
played_after = df_B[(df_B['id'] == x['id']) & (df_B['played_on'] > x['injury_date'])]# respective to the time for the actual row
print(played_after)
played_before = df_B[(df_B['id'] == x['id']) & (df_B['played_on'] <= x['injury_date'])]
print(played_before)
played_after = played_after.groupby('games', as_index=False).first()
print(played_after)
played_before = played_before.groupby('games', as_index=False).first()
print(played_before)
#print(~played_after.empty)
return pd.Series({"id": x['id'], "texts": x["texts"], "injury_date": x["injury_date"],
"played after injury": len(played_after.index) > 0,
"played before injury": len(played_before.index) > 0,
"games played after":played_after['games'].tolist(),
"games played before":played_before['games'].tolist()})
df_A.apply(played_injured, axis=1).groupby("texts", as_index=False).first()