Home > Back-end >  Check previous unique values based on time intervals
Check previous unique values based on time intervals

Time:08-05

I have a df of fruit purchases sorted by person, time. For each person, I want to check if they purchased the same fruit within the last 5 minutes? How could I create this column?

   name  fruit  time   purchased_same_fruit_within5minutes
0  Amy   apple 10:00      False
1  Amy   pear  10:04      False
2  Amy   apple 10:06      False (10:00 apple purchase was 6 minutes ago)
3  Amy   pear  10.07      True (pear at 10:04)
4  Ben   ...

CodePudding user response:

I recreated your dataset using:

df = pd.DataFrame({'name':['A','A','A','A','B'],
                   'fruit':['a','p','a','p','p'],
                   'time':['10:00','10:04','10:06','10:07','10:08']})

Transformed the time column into a datetime format:

df['time'] = pd.to_datetime(df['time'])

And finally using timedelta,s we can calculate the interval and whether the condition is met after grouping by:

df['diff'] = df.groupby(['name', 'fruit'])['time'].diff().fillna(pd.Timedelta(seconds=0))

df = df.assign(purchased_same_fruit_within5minutes = np.where(
    (df['diff'] > pd.Timedelta(0,'m')) & (df['diff'] <= pd.Timedelta(5,'m')),
    True,False)).drop(columns='diff')

Outputting:

  name fruit                time  purchased_same_fruit_within5minutes
0    A     a 2022-08-04 10:00:00                                False
1    A     p 2022-08-04 10:04:00                                False
2    A     a 2022-08-04 10:06:00                                False
3    A     p 2022-08-04 10:07:00                                 True
4    B     p 2022-08-04 10:08:00                                False

CodePudding user response:

This is easy as soon as you convert your time values into a format you can do arithmetic calculations on. You may have to look into pd.to_datetime and look into Celius' answer. For illustration, I use simple integers:

import pandas as pd

df = pd.DataFrame({'name': ['Amy', 'Amy', 'Amy', 'Amy', 'Ben'],
              'fruit': ['apple', 'pear', 'apple', 'pear', 'apple'],
              'time': [10*60, 10*60 4, 10*60 6, 10*60 7, 10*60]})

# init additional col
df['5min'] = None

for (name, fruit), df_small in df.groupby(['name', 'fruit']):
    # make sure times are sorted
    df_small = df_small.sort_values('time')
    # calc delta t
    length_interval = df_small['time'].diff()
    # compare with reference time
    ser_bool = length_interval <= 5
    # save results
    df.loc[ser_bool.index, '5min'] = ser_bool;

Or if you do not shy away from pandas' black magic you can simply do:

df = df.sort_values(['name', 'fruit', 'time'])
df['5min'] = df.groupby(['name', 'fruit'])['time'].diff() <= 5

In that case, please figure out yourself why it works.

CodePudding user response:

df['time'] = pd.to_datetime(df['time'])
df
###
  name  fruit                time
0  Amy  apple 2022-08-04 10:00:00
1  Amy   pear 2022-08-04 10:04:00
2  Amy  apple 2022-08-04 10:06:00
3  Amy   pear 2022-08-04 10:07:00
4  Ben  apple 2022-08-04 10:09:00
5  Ben  apple 2022-08-04 10:11:00
6  Ben   pear 2022-08-04 10:18:00
7  Ben   pear 2022-08-04 10:31:00
df['psfw5'] = (df.set_index('time').groupby(['name','fruit'])['fruit'].apply(lambda x: x.rolling('5T').count()) > 1).values
df
###
  name  fruit                time  psfw5
0  Amy  apple 2022-08-04 10:00:00  False
1  Amy   pear 2022-08-04 10:04:00  False
2  Amy  apple 2022-08-04 10:06:00  False
3  Amy   pear 2022-08-04 10:07:00   True
4  Ben  apple 2022-08-04 10:09:00  False
5  Ben  apple 2022-08-04 10:11:00   True
6  Ben   pear 2022-08-04 10:18:00  False
7  Ben   pear 2022-08-04 10:31:00  False
  • Related