Pandas: how to merge rows by union of intervals-CodePudding

I'm a bit lost with pandas: I want to be able to merge an interval column by taking the union of intervals where the event is the same, but only doing this for seperated annotators. For example, my input would be this:

annotator  event          interval_presence
3          birds          [0,5]
3          birds          [7,9]
3          voices         [1,2]
3          traffic        [1,7]
5          voices         [4,7]
5          voices         [5,10]
5          traffic        [0,1]

Where each item in "interval_presence" is a list. I want this as an output:

annotator  event          interval_presence
3          birds          [[0,5],[7,9]]
3          voices         [1,2]
3          traffic        [1,7]
5          voices         [4,10]
5          traffic        [0,1]

I've seen that I can use the python library "piso" to merge intervals, but I don't know how to merge my intervals only on different events and on different annotators. Have you got any idea on how merge my intervals like this ?

CodePudding user response：

Seems a good example for a supercombo of pandas.DataFrame.groupby and itertools.groupy :

from ast import literal_eval
from itertools import groupby

# df["interval_presence"] = df["interval_presence"].apply(literal_eval) #uncomment this line if string literal

def merge_overintervals(ser):
    ser.sort(key=lambda x: x[0])
    return [next(i) for _, i in groupby(ser, key=lambda x: x[1])]

out = df.groupby(["annotator", "event"], as_index=False, sort=False).agg(list)

out["interval_presence"] = out["interval_presence"].apply(merge_overintervals)

Outupt :

print(out)

   annotator    event interval_presence
0          3    birds  [[0, 5], [7, 9]]
1          3   voices          [[1, 2]]
2          3  traffic          [[1, 7]]
3          5   voices         [[4, 10]]
4          5  traffic          [[0, 1]]

NB : If you need to get a flatten 1-element list when nested, add this :

from itertools import chain

out["interval_presence"]  = [x if len(x)>1 else list(chain(*x)) 
                             for x in out["interval_presence"]]

CodePudding user response：

With the following example data:

data = pd.DataFrame({
    'annotator': [3, 3, 3, 3, 5, 5, 5],
    'event': ['birds', 'birds', 'voices', 'traffic', 'voices', 'voices', 'traffic'],
    'interval_presence': [[0,5], [7,9], [1,2], [1,7], [4,7], [5,10], [0,1]]
})

This code transfers the lists into intervals:

data['interval_presence'] = data['interval_presence'].apply(lambda x: pd.Interval(*x))

And this groups and then applies piso.union to the intervals:

data = data.groupby(['annotator', 'event'])['interval_presence'] \
    .apply(pd.arrays.IntervalArray) \
    .apply(piso.union) \
    .reset_index()

Though, it seems that piso currently only supports intervals that are either left or right closed. There may be a deeper reason for this, but it may also just be out of date. For example, creating the intervals like this instead:

data['interval_presence'] = data['interval_presence'].apply(lambda x: pd.Interval(*x), closed='both')

Will obtain an AttributeError: 'str' object has no attribute 'closed'.