I have a dateframe (df) and I want to build a function that returns another dataframe (df_target).
I want to look at the difference between start and stop in df and if it's under one second I want to collapse the rows. For example row one stops at 3.1 and row two starts at 3.2. So these two I want to collapse to one row where start should be 1.2 and stop 4.4.
The row after starts at 5.8, so this I want to keep. But the new stop for that row should be 14.1 that I want have star as 5.8 and stop would land at 14.1
What is a good method to achieve this in a general function?
df=pd.DataFrame({'start':[1.2, 3.2, 5.8, 6.4, 11.8, 16.9],
'stop':[3.1, 4.4, 6.2, 11.7, 14.1, 19]})
df_target= pd.DataFrame({'start':[1.2, 5.8, 16.9],
'stop':[4.4, 14.1, 19]})
I've strtaed to experiment a bit but dosen't really get quite there. This function get the first row..
d=len(log)
for i in range(d-1):
if log.stop[i] - log.start[i 1] < -1:
print(pd.DataFrame({'start':[log.start[0]],
'stop':[log.stop[i]]}))
break
CodePudding user response:
You can use groupby.agg
with a custom group. This assumes the values are initially sorted.
group = df['start'].sub(df['stop'].shift()).gt(1).cumsum()
df.groupby(group).agg({'start': 'first', 'stop': 'last'})
output:
start stop
0 1.2 4.4
1 5.8 14.1
2 16.9 19.0
Intermediate steps:
df['diff'] = df['start'].sub(df['stop'].shift())
df['>1'] = df['diff'].gt(1)
df['group'] = df['>1'].cumsum()
start stop diff >1 group
0 1.2 3.1 NaN False 0
1 3.2 4.4 0.1 False 0
2 5.8 6.2 1.4 True 1
3 6.4 11.7 0.2 False 1
4 11.8 14.1 0.1 False 1
5 16.9 19.0 2.8 True 2