Home > Software design >  Collapse rows in pandas df based on condition
Collapse rows in pandas df based on condition

Time:04-07

I have a dateframe (df) and I want to build a function that returns another dataframe (df_target).

I want to look at the difference between start and stop in df and if it's under one second I want to collapse the rows. For example row one stops at 3.1 and row two starts at 3.2. So these two I want to collapse to one row where start should be 1.2 and stop 4.4.

The row after starts at 5.8, so this I want to keep. But the new stop for that row should be 14.1 that I want have star as 5.8 and stop would land at 14.1

What is a good method to achieve this in a general function?

df=pd.DataFrame({'start':[1.2, 3.2, 5.8, 6.4, 11.8, 16.9],
            'stop':[3.1, 4.4, 6.2, 11.7, 14.1, 19]})

df_target= pd.DataFrame({'start':[1.2, 5.8, 16.9],
            'stop':[4.4, 14.1, 19]})

I've strtaed to experiment a bit but dosen't really get quite there. This function get the first row..

d=len(log)
for i in range(d-1):
    if log.stop[i] - log.start[i 1] < -1:
        print(pd.DataFrame({'start':[log.start[0]],
                         'stop':[log.stop[i]]}))
        break

CodePudding user response:

You can use groupby.agg with a custom group. This assumes the values are initially sorted.

group = df['start'].sub(df['stop'].shift()).gt(1).cumsum()

df.groupby(group).agg({'start': 'first', 'stop': 'last'})

output:

   start  stop
0    1.2   4.4
1    5.8  14.1
2   16.9  19.0

Intermediate steps:

df['diff'] = df['start'].sub(df['stop'].shift())
df['>1'] = df['diff'].gt(1)
df['group'] = df['>1'].cumsum()

   start  stop  diff     >1  group
0    1.2   3.1   NaN  False      0
1    3.2   4.4   0.1  False      0
2    5.8   6.2   1.4   True      1
3    6.4  11.7   0.2  False      1
4   11.8  14.1   0.1  False      1
5   16.9  19.0   2.8   True      2
  • Related