Home > Software design >  Compare dates from two different dataframes to calculate fields
Compare dates from two different dataframes to calculate fields

Time:06-23

I have two data frames and I want to use the date in one dataframe to calculate some fields from another data dataframe. Is there a better way to do this besides how I have done this below? In a toy example I have a dataframe of players and their injury datetime, and in another I have their games with then they played. I want to calculate if the player played a game before and after their injury and return a flag stating they did and the name of the game they played. Then I want to put this result back into the first dataframe.

It should look something like this:

df_A

    id  texts                   injury_date
0   1   text for person 1       2020-01-01
1   2   text for person 2       2020-02-01
2   1   text for person 1 AGAIN 2021-05-01
3   4   text for person 4       2022-01-01

df_B:

    id  games       played_on
0   2   soccer      2019-01-01
1   1   football    2020-02-01
2   1   tennis      2021-05-01

and want to get:

    id  played after injury played before injury    games played after  games played before
0   1   True                False                   [football, tennis]  []
1   2   False               True                    []                  [soccer]
2   1   False               True                    []                  [football, tennis]
3   4   False               False                   []                  []

and merging it like this

texts                       id  injury_date played after injury played before injury        games played after  games played before
0   text for person 1       1   2020-01-01  True                False                       [football, tennis]  []
1   text for person 1 AGAIN 1   2021-05-01  True                False                       []                  [football, tennis]
2   text for person 2       2   2020-02-01  False               True                        []                  [soccer]
3   text for person 4       4   2022-01-01  False               False                       []                  []

Code:

df_A = pd.DataFrame({"id": [1,2,1,4], "texts":['text for person 1', 'text for person 2', 'text for person 1 AGAIN', 'text for person 4'], "injury_date": [pd.to_datetime('1/1/2020'),pd.to_datetime('2/1/2020'),pd.to_datetime('5/1/2021'),pd.to_datetime('1/1/2022')]})

df_B = pd.DataFrame({"id": [2,1,1], "games":['soccer', 'football', 'tennis'], 'played_on': [pd.to_datetime('1/1/2019'),pd.to_datetime('2/1/2020'),pd.to_datetime('5/1/2021')]})

def played_injured(x):
    played_after = df_B[(df_B['id'] == x['id']) & (df_B['played_on'] > x['injury_date'])]# respective to the time for the actual row
    played_before = df_B[(df_B['id'] == x['id']) & (df_B['played_on'] <= x['injury_date'])]
    played_after = played_after.groupby('games',as_index=False).first()
    played_before = played_before.groupby('games',as_index=False).first()
    #print(~played_after.empty)
    return pd.Series({"id":x['id'], "played after injury": len(played_after.index) > 0,"played before injury": len(played_before.index) > 0, "games played after":played_after['games'].tolist(), "games played before":played_before['games'].tolist()})


pd.merge(df_A, df_A.apply(played_injured, axis=1), on=["id"]).groupby('texts', as_index=False).first() 

CodePudding user response:

I liked your code, and couldn't find anything to make it more efficient. The problem with the output was that it was merging on "id" only, which meant that the first id=1 function output was duplicated. To solve this, instead of merging the output with the original DataFrame, you can return the initial columns too.

def played_injured(x):
    print(x)
    played_after = df_B[(df_B['id'] == x['id']) & (df_B['played_on'] > x['injury_date'])]# respective to the time for the actual row
    print(played_after)
    played_before = df_B[(df_B['id'] == x['id']) & (df_B['played_on'] <= x['injury_date'])]
    print(played_before)
    played_after = played_after.groupby('games', as_index=False).first()
    print(played_after)
    played_before = played_before.groupby('games', as_index=False).first()
    print(played_before)
    #print(~played_after.empty)
    return pd.Series({"id": x['id'], "texts": x["texts"], "injury_date": x["injury_date"],
                      "played after injury": len(played_after.index) > 0,
                      "played before injury": len(played_before.index) > 0,
                      "games played after":played_after['games'].tolist(),
                      "games played before":played_before['games'].tolist()})


df_A.apply(played_injured, axis=1).groupby("texts", as_index=False).first()
  • Related