Pandas - length of connectable pd.Intervals-CodePudding

We start with an interval axis that is divided into bins of length 5. (0,5], (5, 10], ... There is a timestamp column that has some timestamps >= 0. By using pd.cut() the interval bin that corresponds to the timestamp is determined. (e.g. "timestamp" = 3.0 -> "time_bin" = (0,5]).

If there is a time bin that has no corresponding timestamp, it does not show up in the interval column. Thus, there can be interval gaps in the "time_bin" column, e.g., (5,10], (15,20]. (i.e., interval (10,15] is missing // note that the timestamp column is sorted)

The goal is to obtain a column "connected_interval" that indicates whether the current row interval is connected to the previous row interval; connected meaning no interval gaps, i.e., (0,5], (5,10], (10, 15] would be assigned the same integer ID) and a column "conn_interv_length" that indicates for each largest possible connected interval the length of the interval. The interval (0,5], (5,10], (10, 15] would be of length 15.

The initial dataframe has columns "group_id", "timestamp", "time_bin". Columns "connected_interval" & "conn_interv_len" should be computed.

Note: any solution to obtaining the length of populated connected intervals is welcome.

df = pd.DataFrame({"group_id":['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'],\
             "timestamp": [0.0, 3.0, 9.0, 24.2, 30.2, 0.0, 136.51, 222.0, 237.0, 252.0],\
             "time_bin": [pd.Interval(0, 5, closed='left'), pd.Interval(0, 5, closed='left'), pd.Interval(5, 10, closed='left'), pd.Interval(20, 25, closed='left'), pd.Interval(30, 35, closed='left'), pd.Interval(0, 5, closed='left'), pd.Interval(135, 140, closed='left'), pd.Interval(220, 225, closed='left'), pd.Interval(235, 240, closed='left'), pd.Interval(250, 255, closed='left')],\
              "connected_interval":[0, 0, 0, 1, 2, 0, 1, 2, 3, 4],\
              "conn_interv_len":[10, 10, 10, 5, 5, 5, 5, 5, 5, 5],\
             })

input with expected output columns:

  group_id  timestamp    time_bin  connected_interval  conn_interv_len
0        A       0.00      [0, 5)                   0               10
1        A       3.00      [0, 5)                   0               10
2        A       9.00     [5, 10)                   0               10
3        A      24.20    [20, 25)                   1                5
4        A      30.20    [30, 35)                   2                5
5        B       0.00      [0, 5)                   0                5
6        B     136.51  [135, 140)                   1                5
7        B     222.00  [220, 225)                   2                5
8        B     237.00  [235, 240)                   3                5
9        B     252.00  [250, 255)                   4                5

CodePudding user response：

IIUC, you can sort the intervals, drop duplicates, extract the left/right bound, create groups based on the match/mismatch of the successive left/right, then merge again the output to the original:

df2 = (df[['group_id', 'time_bin']]
 # extract bounds and sort intervals
 .assign(left=df['time_bin'].array.left,
         right=df['time_bin'].array.right)
 .sort_values(by=['group_id', 'left', 'right'])
 # ensure no duplicates
 .drop_duplicates(['group_id', 'time_bin'])
 # compute connected intervals and connected length
 .assign(connected_interval=lambda d:
          d.groupby('group_id', group_keys=False)
           .apply(lambda g: g['left'].ne(g['right'].shift())
                                     .cumsum().sub(1)),
         conn_interv_len=lambda d:
          (g := d.groupby(['group_id', 'connected_interval']))['right'].transform('max')
          -g['left'].transform('min')
        )
  .drop(columns=['left', 'right'])
)

# merge to restore missing dropped duplicated rows
out = df.merge(df2)

output:

  group_id  timestamp    time_bin  connected_interval  conn_interv_len
0        A       0.00      [0, 5)                   0               10
1        A       3.00      [0, 5)                   0               10
2        A       9.00     [5, 10)                   0               10
3        A      24.20    [20, 25)                   1                5
4        A      30.20    [30, 35)                   2                5
5        B       0.00      [0, 5)                   0                5
6        B     136.51  [135, 140)                   1                5
7        B     222.00  [220, 225)                   2                5
8        B     237.00  [235, 240)                   3                5
9        B     252.00  [250, 255)                   4                5