How to calculate difference between max and min date for each user-CodePudding

How to calculate the difference between max and min date for each user

I have tried going like -

df['diff'] = (df['purschase_date'].max() - df['purschase_date'].min()).dt.days

But it has calculated across all rows and for not this particular user, how do I go to calculating it by user_id

Here is how dataframe looks like

CodePudding user response：

try this :

import pandas as pd
d = {'user_id': [2432730,2432730, 2432731,2432731],
     'purchase_date': ["2020-09-09", "2020-08-09","2020-09-09","2020-09-19"]}

df = pd.DataFrame(data=d)
df['purchase_date']=pd.to_datetime(df['purchase_date'])

Input :

max_date=df[['user_id','purchase_date']].groupby(by='user_id').max().reset_index().rename(columns={'purchase_date':'max_date'})
min_date=df[['user_id','purchase_date']].groupby(by='user_id').min().reset_index().rename(columns={'purchase_date':'min_date'})
min_date=min_date.join(max_date['max_date'])
min_date['diff']=(min_date['max_date']-min_date['min_date']).dt.days
min_date

Output :

CodePudding user response：

You just need a groupby user_id to calculate the difference

df.groupby('user_id')['purchase_date'].max() - df.groupby('user_id')['purchase_date'].min()

This will create a series and it cannot be directly assigned to the df as there are only two users so you will have two rows. So you need to assign the results back to the dataframe

CodePudding user response：

create new dataframe grouped per id with named cols for min and max values of dates, later merge with original.

data input:

    import numpy as np
    import pandas as pd
    
    df = pd.DataFrame({
        "user_id": (np.random.randint(10000,10004,15, dtype="int32")),
        "purchase_date": (pd.date_range(start='2022-01-01', periods=15, freq='8H')),
        "C": pd.Series(1, index=list(range(15)), dtype="float32"),
        "D": np.array([5] * 15, dtype="int32"),
        "E": "foo",
    })
    df['purchase_date'] = pd.to_datetime(df['purchase_date']).dt.normalize()

    

# Solution


df_grouped = df.groupby(['user_id']).agg(
    date_min=('purchase_date', 'min'),
    date_max=('purchase_date', 'max'))\
    .reset_index()
df_grouped['diff']=(df_grouped['date_max']-df_grouped['date_min']).dt.days
df1 = pd.merge(df, df_grouped)
df1

Out:

   user_id purchase_date    C  D    E   date_min   date_max  diff
0     10001    2022-01-01  1.0  5  foo 2022-01-01 2022-01-04     3
1     10001    2022-01-02  1.0  5  foo 2022-01-01 2022-01-04     3
2     10001    2022-01-03  1.0  5  foo 2022-01-01 2022-01-04     3
3     10001    2022-01-04  1.0  5  foo 2022-01-01 2022-01-04     3
4     10000    2022-01-01  1.0  5  foo 2022-01-01 2022-01-04     3
5     10000    2022-01-02  1.0  5  foo 2022-01-01 2022-01-04     3
6     10000    2022-01-03  1.0  5  foo 2022-01-01 2022-01-04     3
7     10000    2022-01-04  1.0  5  foo 2022-01-01 2022-01-04     3
8     10002    2022-01-01  1.0  5  foo 2022-01-01 2022-01-05     4
9     10002    2022-01-02  1.0  5  foo 2022-01-01 2022-01-05     4
10    10002    2022-01-03  1.0  5  foo 2022-01-01 2022-01-05     4
11    10002    2022-01-05  1.0  5  foo 2022-01-01 2022-01-05     4
12    10002    2022-01-05  1.0  5  foo 2022-01-01 2022-01-05     4
13    10003    2022-01-04  1.0  5  foo 2022-01-04 2022-01-05     1
14    10003    2022-01-05  1.0  5  foo 2022-01-04 2022-01-05     1