I have a df of fruit purchases sorted by person, time. For each person, I want to check if they purchased the same fruit within the last 5 minutes? How could I create this column?
name fruit time purchased_same_fruit_within5minutes
0 Amy apple 10:00 False
1 Amy pear 10:04 False
2 Amy apple 10:06 False (10:00 apple purchase was 6 minutes ago)
3 Amy pear 10.07 True (pear at 10:04)
4 Ben ...
CodePudding user response:
I recreated your dataset using:
df = pd.DataFrame({'name':['A','A','A','A','B'],
'fruit':['a','p','a','p','p'],
'time':['10:00','10:04','10:06','10:07','10:08']})
Transformed the time
column into a datetime format:
df['time'] = pd.to_datetime(df['time'])
And finally using timedelta,s we can calculate the interval and whether the condition is met after grouping by:
df['diff'] = df.groupby(['name', 'fruit'])['time'].diff().fillna(pd.Timedelta(seconds=0))
df = df.assign(purchased_same_fruit_within5minutes = np.where(
(df['diff'] > pd.Timedelta(0,'m')) & (df['diff'] <= pd.Timedelta(5,'m')),
True,False)).drop(columns='diff')
Outputting:
name fruit time purchased_same_fruit_within5minutes
0 A a 2022-08-04 10:00:00 False
1 A p 2022-08-04 10:04:00 False
2 A a 2022-08-04 10:06:00 False
3 A p 2022-08-04 10:07:00 True
4 B p 2022-08-04 10:08:00 False
CodePudding user response:
This is easy as soon as you convert your time values into a format you can do arithmetic calculations on. You may have to look into pd.to_datetime
and look into Celius' answer. For illustration, I use simple integers:
import pandas as pd
df = pd.DataFrame({'name': ['Amy', 'Amy', 'Amy', 'Amy', 'Ben'],
'fruit': ['apple', 'pear', 'apple', 'pear', 'apple'],
'time': [10*60, 10*60 4, 10*60 6, 10*60 7, 10*60]})
# init additional col
df['5min'] = None
for (name, fruit), df_small in df.groupby(['name', 'fruit']):
# make sure times are sorted
df_small = df_small.sort_values('time')
# calc delta t
length_interval = df_small['time'].diff()
# compare with reference time
ser_bool = length_interval <= 5
# save results
df.loc[ser_bool.index, '5min'] = ser_bool;
Or if you do not shy away from pandas' black magic you can simply do:
df = df.sort_values(['name', 'fruit', 'time'])
df['5min'] = df.groupby(['name', 'fruit'])['time'].diff() <= 5
In that case, please figure out yourself why it works.
CodePudding user response:
df['time'] = pd.to_datetime(df['time'])
df
###
name fruit time
0 Amy apple 2022-08-04 10:00:00
1 Amy pear 2022-08-04 10:04:00
2 Amy apple 2022-08-04 10:06:00
3 Amy pear 2022-08-04 10:07:00
4 Ben apple 2022-08-04 10:09:00
5 Ben apple 2022-08-04 10:11:00
6 Ben pear 2022-08-04 10:18:00
7 Ben pear 2022-08-04 10:31:00
df['psfw5'] = (df.set_index('time').groupby(['name','fruit'])['fruit'].apply(lambda x: x.rolling('5T').count()) > 1).values
df
###
name fruit time psfw5
0 Amy apple 2022-08-04 10:00:00 False
1 Amy pear 2022-08-04 10:04:00 False
2 Amy apple 2022-08-04 10:06:00 False
3 Amy pear 2022-08-04 10:07:00 True
4 Ben apple 2022-08-04 10:09:00 False
5 Ben apple 2022-08-04 10:11:00 True
6 Ben pear 2022-08-04 10:18:00 False
7 Ben pear 2022-08-04 10:31:00 False