Home > database >  datetime manipulation through pandas groupby
datetime manipulation through pandas groupby

Time:06-06

Strating from the dataset below :

User_id temp_Id Date
A A1 01/02/2020
A A1 01/02/2020
A A2 05/02/2020
A A3 07/02/2020
B B1 01/01/2020
C C1 05/01/2020
C C2 06/02/2020
D D1 07/01/2020
D D2 08/02/2020
D D3 09/02/2020

how would I get a new dataframe with two columns such that for each user_id (1st columns) the second column will provide the difference between the last and previous connection date ? PS: for B the results will be zero

It worked using nested loop over all user_id but I am looking for the most pythonic way to do that. I guess pandas.groupby would be the best thing to do.

Thank you

Edit:Date time format is Day/Month/Year

The desired output would be like this

User_id Delta
A 2
B 0
C 31
D 1

CodePudding user response:

assuming the year is 2020 if your date has a year as part of it, then the first line in computing date2, is to be modified

calculate a date as yyyy-mm-dd in a new column in the df

using groupby and transform to calculate the difference between current and previous as days

updated line one, now that the date has a year as part of it

###df['date2'] = df['Date'].str.split('/').apply(lambda x: '2020-' x[1] '-' x[0] ).astype('datetime64')

# date is in dd/mm/yyyy format
df['date2'] = df['Date'].str.split('/').apply(lambda x: x[2] '-' x[1] '-' x[0] ).astype('datetime64')

df['diff']=df.groupby('User_id')['date2'].transform(lambda y: y-y.shift(1))

    User_id temp_Id Date    date2       diff
0   A         A1    01/02   2020-02-01  NaT
1   A         A1    01/02   2020-02-01  0 days
2   A         A2    05/02   2020-02-05  4 days
3   A         A3    07/02   2020-02-07  2 days
4   B         B1    01/01   2020-01-01  NaT
5   C         C1    05/01   2020-01-05  NaT
6   C         C2    06/02   2020-02-06  32 days
7   D         D1    07/01   2020-01-07  NaT
8   D         D2    08/02   2020-02-08  32 days
9   D         D3    09/02   2020-02-09  1 days

obtaining the last value, after question is updated with more details

df.groupby(['User_id'])['diff'].last().to_frame()
User_id diff
A       2 days
B       NaT
C       32 days
D       1 days

CodePudding user response:

the second column will provide the difference between the last and previous connection date.

Can you give me the sample desired output?

I added the answer

def get_result(list_day):
    list_day = sorted(list_day)
    if list_day == 1:
        return 0
    return (list_day[-1] - list_day[-2]).days

group_df = df.groupby("User_id")["Date"].apply(lambda x: get_result(list(x)))
group_df
  • Related