How many records include each time interval in Pandas?-CodePudding

I have two Pandas event dataframes. The first is for longer events, while the second is for shorter events. The start and end times of the events define them. Every long event is a "container" of at least one short event. Each short event is part of only one long event interval.

I try to figure out how many short events/intervals are in each long event.

Minimal example:

# long events
start | end | tag
0.5   | 0.7 | AAA
0.7   | 1.2 | BBB
1.2   | 2   | CCC

# short events

start | end | tag
0.5   | 0.55| a
0.55  | 0.7 | aa
0.7   | 1.2 | b
1.2   | 1.3 | c
1.3   | 1.4 | cc
1.4   | 1.5 | ccc
1.5   | 1.6 | cccc
1.6   | 2   | ccccc

The desired output is:

start | end | tag | count of sub events
0.5   | 0.7 | AAA | 2
0.7   | 1.2 | BBB | 1
1.2   | 2   | CCC | 5

I attempted to use "apply", so I created a function that looked like this:

def count_records_in_time_range(df, start, end):
    return len(df[(df['start'] >= start) & (df['end'] <= end)])

But I'm not sure how to proceed; should I use this function on the short events tables or the long events tables? How can I calculate the desired count per record in one table (the long events dataframe) based on the records in the other table (the short events dataframe)?

CodePudding user response：

Merge on start and fffill. groupby and map to the long event.

s=shortevents.merge(longevents.drop('end',1), how='left', on=['start'], suffixes=('','_y')).fillna(method='ffill')
longevents['count of sub events'] = longevents['tag'].map(s.groupby('tag_y').size())


      start end  tag       count of sub events
0      0.5  0.7  AAA                    2
1      0.7  1.2  BBB                    1
2      1.2  2.0  CCC                    5

CodePudding user response：

We can use numpy broadcasting here

s, e = long[['start', 'end']].values[None, :].T
long['count'] = ((s <= short['start'].values) & (e >= short['end'].values)).sum(1)

   start  end  tag  count
0    0.5  0.7  AAA      2
1    0.7  1.2  BBB      1
2    1.2  2.0  CCC      5

CodePudding user response：

Here is a solution using a package called piso, which is built upon pandas.

setup

long = pd.DataFrame(
    {
        "start":[0.5, 0.7, 1.2],
        "end":[0.7, 1.2, 2],
        "tag":["AAA", "BBB", "CCC"]
    }
)

short = pd.DataFrame(
    {
        "start":[0.5, 0.55, 0.7, 1.2, 1.3, 1.4, 1.5, 1.6],
        "end":[0.55, 0.7, 1.2, 1.3, 1.4, 1.5, 1.6, 2],
        "tag":["a", "aa", "b", "c", "cc", "ccc", "cccc", "ccccc"],
    }
)

solution

Import piso and create a interval index from the longer intervals. We'll use it as an index for a Series, where the values are from the tag column

import piso

long_ii = long.set_index(pd.IntervalIndex.from_arrays(long["start"], long["end"]))["tag"]

long_ii looks like this

(0.5, 0.7]    AAA
(0.7, 1.2]    BBB
(1.2, 2.0]    CCC
Name: tag, dtype: object

If every short interval is contained within a long interval then it is enough to check if the end points is contained in the intervals in the interval index. Since the default for interval index is right-closed you'll want to check the "end" end points. If we created the interval index with left closed we'd check the "start" end points.

Next we use piso.lookup which takes a dataframe or series, indexed by an interval index, and uses it as a look up table for an array of points. Eg

containers = piso.lookup(long_ii, short["end"])

containers looks like this

end
0.55    AAA
0.70    AAA
1.20    BBB
1.30    CCC
1.40    CCC
1.50    CCC
1.60    CCC
2.00    CCC
Name: tag, dtype: object

Then we use pandas.Series.value_counts

container_counts = containers.value_counts()

container_counts looks like this

CCC    5
AAA    2
BBB    1
Name: tag, dtype: int64

You can then join this Series to your original dataframe, based on tag to get the desired result.

summary

import piso

long_ii = long.set_index(pd.IntervalIndex.from_arrays(long["start"], long["end"]))["tag"]
piso.lookup(long_ii, short["end"]).value_counts()