I have this data in my csv
A_PERSON,B_PERSON,DATE_TIME,DURATION
190,390,'2020-04-20 12:44:36',323
282,811,'2020-04-06 11:12:24',25
495,414,'2020-04-20 11:22:13',11
827,158,'2020-04-30 13:27:22',22
827,158,'2020-04-30 13:27:44',15
I am trying to group rows with same A_PERSON,B_PERSON
and if the difference between DATE_TIME DURATION
of one row and DATE_TIME
of another row is less than 2 seconds. For example in the last 2 rows I have same A_PERSON,B_PERSON
and the difference between DATE_TIME of second last row DURATION of second last row
and DATE_TIME of last rows
is less than 2 seconds so only the last row should be merged and all other rows will be displayed as it is.
Desired Output
A_PERSON,B_PERSON,DATE_TIME,DURATION
190,390,'2020-04-20 12:44:36',323
282,811,'2020-04-06 11:12:24',25
495,414,'2020-04-20 11:22:13',11
827,158,'2020-04-30 13:27:22',37
Till now I have tried this
def merger(dataframe:pd.core.frame.DataFrame)->pd.core.frame.DataFrame:
dataframe['DATE_TIME'] = pd.to_datetime(dataframe['DATE_TIME'])
dataframe['epoch'] = (dataframe['DATE_TIME'] - datetime.datetime(1970,1,1)).dt.total_seconds()
mask = dataframe[((dataframe['epoch']) < dataframe['epoch'] 1 dataframe['DURATION'])]
grouped = mask.groupby(["A_PERSON", "B_PERSON"]).sum("DURATION")
print(grouped)
return grouped
In this code group by
on the A_PERSON,B_PERSON
is working but the where mask
is not working
Sample 2
A_PERSON,B_PERSON,DATE_TIME,DURATION
441785807190,4299330390,'2020-04-20 12:44:36',323
441785808282,4238900811,'2020-04-06 11:12:24',25
14244012495,3104405414,'2020-04-20 11:22:13',11
96897940827,3139578158,'2020-04-30 13:27:02',32
96897940827,3139578158,'2020-04-30 13:27:34',16
Desired output for sample 2
A_PERSON,B_PERSON,DATE_TIME,DURATION
441785807190,4299330390,'2020-04-20 12:44:36',323
441785808282,4238900811,'2020-04-06 11:12:24',25
14244012495,3104405414,'2020-04-20 11:22:13',11
96897940827,3139578158,'2020-04-30 13:27:02',48
Sample 3
A_PERSON,B_PERSON,DATE_TIME,DURATION
441785807190,4299330390,'2020-04-20 12:44:36',323
96897940827,3139578158,'2020-04-30 13:27:00',33
441785808282,4238900811,'2020-04-06 11:12:24',25
14244012495,3104405414,'2020-04-20 11:22:13',11
96897940827,3139578158,'2020-04-30 13:27:34',16
Desired output for sample 3
A_PERSON,B_PERSON,DATE_TIME,DURATION
441785807190,4299330390,'2020-04-20 12:44:36',323
96897940827,3139578158,'2020-04-30 13:27:00',49
441785808282,4238900811,'2020-04-06 11:12:24',25
14244012495,3104405414,'2020-04-20 11:22:13',11
CodePudding user response:
In sample data is difference 5
seconds in last group (13:27:59 - 13:27:54 = 5seconds
).
Solution is add DURATION
in seconds to new column add
and per groups get differencies by DataFrameGroupBy.diff
with compare invert condition for greater like N
with cumulative sum for new group column, last aggregate first
and sum
:
N = 5
dataframe['DATE_TIME'] = pd.to_datetime(dataframe['DATE_TIME'])
dataframe['add'] = dataframe['DATE_TIME'] pd.to_timedelta(dataframe['DURATION'],unit='s')
f = lambda x: x.diff().dt.total_seconds().gt(N).cumsum()
dataframe['g'] = dataframe.groupby(["A_PERSON", "B_PERSON"])['add'].transform(f)
print (dataframe)
A_PERSON B_PERSON DATE_TIME DURATION add g
0 190 390 2020-04-20 12:44:36 323 2020-04-20 12:49:59 0
1 282 811 2020-04-06 11:12:24 25 2020-04-06 11:12:49 0
2 495 414 2020-04-20 11:22:13 11 2020-04-20 11:22:24 0
3 827 158 2020-04-30 13:27:32 22 2020-04-30 13:27:54 0
4 827 158 2020-04-30 13:27:44 15 2020-04-30 13:27:59 0
dataframe = (dataframe.groupby(["A_PERSON", "B_PERSON", 'g'])
.agg({'DATE_TIME':'first', 'DURATION':'sum'})
.droplevel(-1)
.reset_index())
print (dataframe)
A_PERSON B_PERSON DATE_TIME DURATION
0 190 390 2020-04-20 12:44:36 323
1 282 811 2020-04-06 11:12:24 25
2 495 414 2020-04-20 11:22:13 11
3 827 158 2020-04-30 13:27:32 37
If need compare add
shifted per groups by DATE_TIME
column solution (with new data) is:
N = 2
dataframe['DATE_TIME'] = pd.to_datetime(dataframe['DATE_TIME'])
dataframe['add'] = dataframe['DATE_TIME'] pd.to_timedelta(dataframe['DURATION'],unit='s')
dataframe['diff'] = dataframe['DATE_TIME'].sub(dataframe.groupby(["A_PERSON", "B_PERSON"])['add'].shift()).dt.total_seconds().gt(N)
dataframe['g'] = dataframe.groupby(["A_PERSON", "B_PERSON"])['diff'].cumsum()
print (dataframe)
A_PERSON B_PERSON DATE_TIME DURATION add \
0 190 390 2020-04-20 12:44:36 323 2020-04-20 12:49:59
1 282 811 2020-04-06 11:12:24 25 2020-04-06 11:12:49
2 495 414 2020-04-20 11:22:13 11 2020-04-20 11:22:24
3 827 158 2020-04-30 13:27:22 22 2020-04-30 13:27:44
4 827 158 2020-04-30 13:27:44 15 2020-04-30 13:27:59
diff g
0 False 0
1 False 0
2 False 0
3 False 0
4 False 0
dataframe = (dataframe.groupby(["A_PERSON", "B_PERSON", 'g'])
.agg({'DATE_TIME':'first', 'DURATION':'sum'})
.droplevel(-1)
.reset_index())
print (dataframe)
A_PERSON B_PERSON DATE_TIME DURATION
0 190 390 2020-04-20 12:44:36 323
1 282 811 2020-04-06 11:12:24 25
2 495 414 2020-04-20 11:22:13 11
3 827 158 2020-04-30 13:27:22 37
Tested 3rd sample:
N = 2
dataframe['DATE_TIME'] = pd.to_datetime(dataframe['DATE_TIME'])
dataframe['add'] = dataframe['DATE_TIME'] pd.to_timedelta(dataframe['DURATION'],unit='s')
dataframe['diff'] = dataframe['DATE_TIME'].sub(dataframe.groupby(["A_PERSON", "B_PERSON"])['add'].shift()).dt.total_seconds().gt(N)
dataframe['g'] = dataframe.groupby(["A_PERSON", "B_PERSON"])['diff'].cumsum()
print (dataframe)
A_PERSON B_PERSON DATE_TIME DURATION add \
0 441785807190 4299330390 2020-04-20 12:44:36 323 2020-04-20 12:49:59
1 96897940827 3139578158 2020-04-30 13:27:00 33 2020-04-30 13:27:33
2 441785808282 4238900811 2020-04-06 11:12:24 25 2020-04-06 11:12:49
3 14244012495 3104405414 2020-04-20 11:22:13 11 2020-04-20 11:22:24
4 96897940827 3139578158 2020-04-30 13:27:34 16 2020-04-30 13:27:50
diff g
0 False 0
1 False 0
2 False 0
3 False 0
4 False 0
dataframe = (dataframe.groupby(["A_PERSON", "B_PERSON", 'g'])
.agg({'DATE_TIME':'first', 'DURATION':'sum'})
.droplevel(-1)
.reset_index())
print (dataframe)
A_PERSON B_PERSON DATE_TIME DURATION
0 14244012495 3104405414 2020-04-20 11:22:13 11
1 96897940827 3139578158 2020-04-30 13:27:00 49
2 441785807190 4299330390 2020-04-20 12:44:36 323
3 441785808282 4238900811 2020-04-06 11:12:24 25