I need to create a new df that takes both the most recent date and the second most recent date for each store in the Date Time column; however, not all stores have a previous visit dates so some may need to return Nan.
df
Store | Date Time |
---|---|
100 | 2022-03-24 19:04:00 |
100 | 2022-05-05 10:29:00 |
100 | 2022-07-29 11:58:00 |
101 | 2022-03-15 08:56:00 |
102 | 2022-04-15 10:21:00 |
Resulting new df:
Store | Previous Visit Date | Most Recent Visit Date |
---|---|---|
100 | 2022-05-05 10:29:00 | 2022-07-29 11:58:00 |
101 | Nan | 2022-03-15 08:56:00 |
102 | Nan | 2022-04-15 10:21:00 |
I tried first to sort values by store than index. df = df.set_index('Store').sort_values(['Store', 'Date Time'])
I don't know how to integrate if store is unique take Most Recent Visit Date, if it has multiple entries pick the closet date = df['Most Recent Visit Date'] , and 2nd closest visit date = df['Previous Visit Date']. I might be making this more complicated than it needs to be.
CodePudding user response:
You can use groupby()
with shift()
here:
(df.pipe(lambda d: d.set_index('Store')).groupby('Store')['Date Time'].shift().rename('Previous Visit Date') )
The pipe
step is just to set the index so we can groupby
by store. This outputs:
Store
100 NaT
100 2022-03-24
100 2022-05-05
101 2022-03-15
102 2022-04-15
Name: Previous Visit Date, dtype: datetime64[ns]
So, you can easily add this as a column in your original df.
CodePudding user response:
You can group the dataframe by store
, then sort the Date Time values for each group, and take the most recent and second recent values for the timestamp:
(
df
.groupby('Store')['Date Time']
.agg(lambda x: dict(zip(['Most Recent Visit Date', 'Previous Visit Date'],
x.sort_values(ascending=False)[:2])))
.apply(pd.Series)
.reset_index()
)
OUTPUT
Store Most Recent Visit Date Previous Visit Date
0 100 2022-07-29 11:58:00 2022-05-05 10:29:00
1 101 2022-03-15 08:56:00 NaT
2 102 2022-04-15 10:21:00 NaT
CodePudding user response:
You can sort the dataframe by Date Time
ascending first then group by Store
column to get the last two recent dates.
df['Date Time'] = pd.to_datetime(df['Date Time'])
df = df.sort_values(['Store', 'Date Time'])
out = (df.set_index(df.groupby('Store').cumcount(ascending=False))
.groupby('Store')
.apply(lambda g: g['Date Time'].iloc[-2:])
.unstack().reset_index()
.rename(columns={0: 'Most Recent Visit Date', 1: 'Previous Visit Date'}))
print(df)
Store Date Time
0 100 2022-03-24 19:04:00
1 100 2022-05-05 10:29:00
2 100 2022-07-29 11:58:00
3 101 2022-03-15 08:56:00
4 102 2022-04-15 10:21:00
print(out)
Store Most Recent Visit Date Previous Visit Date
0 100 2022-07-29 11:58:00 2022-05-05 10:29:00
1 101 2022-03-15 08:56:00 NaT
2 102 2022-04-15 10:21:00 NaT