Date interval average Python pandas-CodePudding

This is my dataframe:

ID number	Date purchase
1	2022-05-01
1	2021-03-03
1	2020-01-03
2	2019-01-03
2	2018-01-03

I want to get a horizontal dataframe with alle the dates in seperate columns per ID number.

So like this:

ID number	Date 1	Date 2	Date 3
1	2022-05-01	2021-03-03	2020-01-03
2	2019-01-03	2018-01-03

After I did this I want to calculate the difference between these dates.

CodePudding user response：

First step is GroupBy.cumcount with DataFrame.pivot:

df['Date purchase'] = pd.to_datetime(df['Date purchase'])
df1 = (df.sort_values(by=['ID number', 'Date purchase'], ascending=[True, False])
         .assign(g=lambda x: x.groupby('ID number').cumcount())
         .pivot('ID number','g','Date purchase')
         .rename(columns = lambda x: f'Date {x   1}'))
print (df1)
g             Date 1     Date 2     Date 3
ID number                                 
1         2022-05-01 2021-03-03 2020-01-03
2         2019-01-03 2018-01-03        NaT

Then for differencies between columns use DataFrame.diff:

df2 = df1.diff(-1, axis=1)
print (df2)
g           Date 1   Date 2 Date 3
ID number                         
1         424 days 425 days    NaT
2         365 days      NaT    NaT

If need averages:

df3 = df1.apply(pd.Series.mean, axis=1).reset_index(name='Avg Dates').rename_axis(None, axis=1)
print (df3)
   ID number           Avg Dates
0          1 2021-03-02 16:00:00
1          2 2018-07-04 12:00:00

CodePudding user response：

Could you do something like this?

def format_dataframe(df):
    """
    Function formats the dataframe to the following:
    | ID number| Date 1         | Date 2         | Date 3         |
    | -------- | -------------- | -------------- | -------------- |
    | 1        | 2022-05-01     | 2021-03-03     | 2020-01-03     |
    | 2        | 2019-01-03     | 2018-01-03     |                |
    """
    df = df.sort_values(by=['ID number', 'Date purchase'])
    df = df.drop_duplicates(subset=['ID number'], keep='first')
    df = df.drop(columns=['Date purchase'])
    df = df.rename(columns={'ID number': 'ID number', 'Date 1': 'Date 1', 'Date 2': 'Date 2', 'Date 3': 'Date 3'})
    return df