Home > database >  Time since first ever occurrence in Pandas
Time since first ever occurrence in Pandas

Time:08-21

I have the following data frame in Pandas:

df = pd.DataFrame({
     'ID': [1,2,1,1,2,3,1,3,3,3,2],
     'date': ['2021-04-28','2022-05-21','2011-03-01','2021-11-28','1992-12-01','1999-10-28','2022-01-12','2019-02-28','2001-03-28','2022-01-01','2009-05-28']
})

I want to produce a column time since first occur that is the time passed in days since their first occurrence.

Here is what I did:

df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df.sort_values(by=['ID', 'date'], ascending = [True, False], inplace=True)

and I got the sorted data frame

    ID  date
6   1   2022-01-12
3   1   2021-11-28
0   1   2021-04-28
2   1   2011-03-01
1   2   2022-05-21
10  2   2009-05-28
4   2   1992-12-01
9   3   2022-01-01
7   3   2019-02-28
8   3   2001-03-28
5   3   1999-10-28

so the output should look like

    ID  date       time since first occur
6   1   2022-01-12 3970
3   1   2021-11-28 3925
0   1   2021-04-28 3711
2   1   2011-03-01 0
1   2   2022-05-21 10763
10  2   2009-05-28 6022
4   2   1992-12-01 0
9   3   2022-01-01 8101
7   3   2019-02-28 7063
8   3   2001-03-28 517
5   3   1999-10-28 0

Thanks in advance for helping.

CodePudding user response:

After sorting the dataframe, you can get the difference between date and minimal date in group

df['time since first occur'] = (df['date'] - df.groupby('ID')['date'].transform('min')).dt.days
print(df)

    ID       date  time since first occur
6    1 2022-01-12                    3970
3    1 2021-11-28                    3925
0    1 2021-04-28                    3711
2    1 2011-03-01                       0
1    2 2022-05-21                   10763
10   2 2009-05-28                    6022
4    2 1992-12-01                       0
9    3 2022-01-01                    8101
7    3 2019-02-28                    7063
8    3 2001-03-28                     517
5    3 1999-10-28                       0
  • Related