This is my dataframe:
ID number | Date purchase |
---|---|
1 | 2022-05-01 |
1 | 2021-03-03 |
1 | 2020-01-03 |
2 | 2019-01-03 |
2 | 2018-01-03 |
I want to get a horizontal dataframe with alle the dates in seperate columns per ID number.
So like this:
ID number | Date 1 | Date 2 | Date 3 |
---|---|---|---|
1 | 2022-05-01 | 2021-03-03 | 2020-01-03 |
2 | 2019-01-03 | 2018-01-03 |
After I did this I want to calculate the difference between these dates.
CodePudding user response:
First step is GroupBy.cumcount
with DataFrame.pivot
:
df['Date purchase'] = pd.to_datetime(df['Date purchase'])
df1 = (df.sort_values(by=['ID number', 'Date purchase'], ascending=[True, False])
.assign(g=lambda x: x.groupby('ID number').cumcount())
.pivot('ID number','g','Date purchase')
.rename(columns = lambda x: f'Date {x 1}'))
print (df1)
g Date 1 Date 2 Date 3
ID number
1 2022-05-01 2021-03-03 2020-01-03
2 2019-01-03 2018-01-03 NaT
Then for differencies between columns use DataFrame.diff
:
df2 = df1.diff(-1, axis=1)
print (df2)
g Date 1 Date 2 Date 3
ID number
1 424 days 425 days NaT
2 365 days NaT NaT
If need averages:
df3 = df1.apply(pd.Series.mean, axis=1).reset_index(name='Avg Dates').rename_axis(None, axis=1)
print (df3)
ID number Avg Dates
0 1 2021-03-02 16:00:00
1 2 2018-07-04 12:00:00
CodePudding user response:
Could you do something like this?
def format_dataframe(df):
"""
Function formats the dataframe to the following:
| ID number| Date 1 | Date 2 | Date 3 |
| -------- | -------------- | -------------- | -------------- |
| 1 | 2022-05-01 | 2021-03-03 | 2020-01-03 |
| 2 | 2019-01-03 | 2018-01-03 | |
"""
df = df.sort_values(by=['ID number', 'Date purchase'])
df = df.drop_duplicates(subset=['ID number'], keep='first')
df = df.drop(columns=['Date purchase'])
df = df.rename(columns={'ID number': 'ID number', 'Date 1': 'Date 1', 'Date 2': 'Date 2', 'Date 3': 'Date 3'})
return df