Strating from the dataset below :
User_id | temp_Id | Date |
---|---|---|
A | A1 | 01/02/2020 |
A | A1 | 01/02/2020 |
A | A2 | 05/02/2020 |
A | A3 | 07/02/2020 |
B | B1 | 01/01/2020 |
C | C1 | 05/01/2020 |
C | C2 | 06/02/2020 |
D | D1 | 07/01/2020 |
D | D2 | 08/02/2020 |
D | D3 | 09/02/2020 |
how would I get a new dataframe with two columns such that for each user_id
(1st columns) the second column will provide the difference between the last and previous connection date ?
PS: for B the results will be zero
It worked using nested loop over all user_id but I am looking for the most pythonic way to do that. I guess pandas.groupby would be the best thing to do.
Thank you
Edit:Date time format is Day/Month/Year
The desired output would be like this
User_id | Delta |
---|---|
A | 2 |
B | 0 |
C | 31 |
D | 1 |
CodePudding user response:
assuming the year is 2020 if your date has a year as part of it, then the first line in computing date2, is to be modified
calculate a date as yyyy-mm-dd in a new column in the df
using groupby and transform to calculate the difference between current and previous as days
updated line one, now that the date has a year as part of it
###df['date2'] = df['Date'].str.split('/').apply(lambda x: '2020-' x[1] '-' x[0] ).astype('datetime64')
# date is in dd/mm/yyyy format
df['date2'] = df['Date'].str.split('/').apply(lambda x: x[2] '-' x[1] '-' x[0] ).astype('datetime64')
df['diff']=df.groupby('User_id')['date2'].transform(lambda y: y-y.shift(1))
User_id temp_Id Date date2 diff
0 A A1 01/02 2020-02-01 NaT
1 A A1 01/02 2020-02-01 0 days
2 A A2 05/02 2020-02-05 4 days
3 A A3 07/02 2020-02-07 2 days
4 B B1 01/01 2020-01-01 NaT
5 C C1 05/01 2020-01-05 NaT
6 C C2 06/02 2020-02-06 32 days
7 D D1 07/01 2020-01-07 NaT
8 D D2 08/02 2020-02-08 32 days
9 D D3 09/02 2020-02-09 1 days
obtaining the last value, after question is updated with more details
df.groupby(['User_id'])['diff'].last().to_frame()
User_id diff
A 2 days
B NaT
C 32 days
D 1 days
CodePudding user response:
the second column will provide the difference between the last and previous connection date.
Can you give me the sample desired output?
I added the answer
def get_result(list_day):
list_day = sorted(list_day)
if list_day == 1:
return 0
return (list_day[-1] - list_day[-2]).days
group_df = df.groupby("User_id")["Date"].apply(lambda x: get_result(list(x)))
group_df