Home > OS >  How do I get the first and second closest datetime in date by a specific column?
How do I get the first and second closest datetime in date by a specific column?

Time:09-10

I need to create a new df that takes both the most recent date and the second most recent date for each store in the Date Time column; however, not all stores have a previous visit dates so some may need to return Nan.

df

Store Date Time
100 2022-03-24 19:04:00
100 2022-05-05 10:29:00
100 2022-07-29 11:58:00
101 2022-03-15 08:56:00
102 2022-04-15 10:21:00

Resulting new df:

Store Previous Visit Date Most Recent Visit Date
100 2022-05-05 10:29:00 2022-07-29 11:58:00
101 Nan 2022-03-15 08:56:00
102 Nan 2022-04-15 10:21:00

I tried first to sort values by store than index. df = df.set_index('Store').sort_values(['Store', 'Date Time'])

I don't know how to integrate if store is unique take Most Recent Visit Date, if it has multiple entries pick the closet date = df['Most Recent Visit Date'] , and 2nd closest visit date = df['Previous Visit Date']. I might be making this more complicated than it needs to be.

CodePudding user response:

You can use groupby() with shift() here: (df.pipe(lambda d: d.set_index('Store')).groupby('Store')['Date Time'].shift().rename('Previous Visit Date') ) The pipe step is just to set the index so we can groupby by store. This outputs: Store 100 NaT 100 2022-03-24 100 2022-05-05 101 2022-03-15 102 2022-04-15 Name: Previous Visit Date, dtype: datetime64[ns] So, you can easily add this as a column in your original df.

CodePudding user response:

You can group the dataframe by store, then sort the Date Time values for each group, and take the most recent and second recent values for the timestamp:

(
    df
    .groupby('Store')['Date Time']
    .agg(lambda x: dict(zip(['Most Recent Visit Date', 'Previous Visit Date'],
                            x.sort_values(ascending=False)[:2])))
    .apply(pd.Series)
    .reset_index()
)

OUTPUT

   Store Most Recent Visit Date Previous Visit Date
0    100    2022-07-29 11:58:00 2022-05-05 10:29:00
1    101    2022-03-15 08:56:00                 NaT
2    102    2022-04-15 10:21:00                 NaT

CodePudding user response:

You can sort the dataframe by Date Time ascending first then group by Store column to get the last two recent dates.

df['Date Time'] = pd.to_datetime(df['Date Time'])
df = df.sort_values(['Store', 'Date Time'])

out = (df.set_index(df.groupby('Store').cumcount(ascending=False))
       .groupby('Store')
       .apply(lambda g: g['Date Time'].iloc[-2:])
       .unstack().reset_index()
       .rename(columns={0: 'Most Recent Visit Date', 1: 'Previous Visit Date'}))
print(df)

   Store           Date Time
0    100 2022-03-24 19:04:00
1    100 2022-05-05 10:29:00
2    100 2022-07-29 11:58:00
3    101 2022-03-15 08:56:00
4    102 2022-04-15 10:21:00

print(out)

   Store Most Recent Visit Date Previous Visit Date
0    100    2022-07-29 11:58:00 2022-05-05 10:29:00
1    101    2022-03-15 08:56:00                 NaT
2    102    2022-04-15 10:21:00                 NaT
  • Related