I have two Pandas event dataframes. The first is for longer events, while the second is for shorter events. The start and end times of the events define them. Every long event is a "container" of at least one short event. Each short event is part of only one long event interval.
I try to figure out how many short events/intervals are in each long event.
Minimal example:
# long events
start | end | tag
0.5 | 0.7 | AAA
0.7 | 1.2 | BBB
1.2 | 2 | CCC
# short events
start | end | tag
0.5 | 0.55| a
0.55 | 0.7 | aa
0.7 | 1.2 | b
1.2 | 1.3 | c
1.3 | 1.4 | cc
1.4 | 1.5 | ccc
1.5 | 1.6 | cccc
1.6 | 2 | ccccc
The desired output is:
start | end | tag | count of sub events
0.5 | 0.7 | AAA | 2
0.7 | 1.2 | BBB | 1
1.2 | 2 | CCC | 5
I attempted to use "apply", so I created a function that looked like this:
def count_records_in_time_range(df, start, end):
return len(df[(df['start'] >= start) & (df['end'] <= end)])
But I'm not sure how to proceed; should I use this function on the short events tables or the long events tables? How can I calculate the desired count per record in one table (the long events dataframe) based on the records in the other table (the short events dataframe)?
CodePudding user response:
Merge on start and fffill. groupby and map to the long event.
s=shortevents.merge(longevents.drop('end',1), how='left', on=['start'], suffixes=('','_y')).fillna(method='ffill')
longevents['count of sub events'] = longevents['tag'].map(s.groupby('tag_y').size())
start end tag count of sub events
0 0.5 0.7 AAA 2
1 0.7 1.2 BBB 1
2 1.2 2.0 CCC 5
CodePudding user response:
We can use numpy broadcasting
here
s, e = long[['start', 'end']].values[None, :].T
long['count'] = ((s <= short['start'].values) & (e >= short['end'].values)).sum(1)
start end tag count
0 0.5 0.7 AAA 2
1 0.7 1.2 BBB 1
2 1.2 2.0 CCC 5
CodePudding user response:
Here is a solution using a package called piso, which is built upon pandas.
setup
long = pd.DataFrame(
{
"start":[0.5, 0.7, 1.2],
"end":[0.7, 1.2, 2],
"tag":["AAA", "BBB", "CCC"]
}
)
short = pd.DataFrame(
{
"start":[0.5, 0.55, 0.7, 1.2, 1.3, 1.4, 1.5, 1.6],
"end":[0.55, 0.7, 1.2, 1.3, 1.4, 1.5, 1.6, 2],
"tag":["a", "aa", "b", "c", "cc", "ccc", "cccc", "ccccc"],
}
)
solution
Import piso
and create a interval index from the longer intervals. We'll use it as an index for a Series, where the values are from the tag
column
import piso
long_ii = long.set_index(pd.IntervalIndex.from_arrays(long["start"], long["end"]))["tag"]
long_ii
looks like this
(0.5, 0.7] AAA
(0.7, 1.2] BBB
(1.2, 2.0] CCC
Name: tag, dtype: object
If every short interval is contained within a long interval then it is enough to check if the end points is contained in the intervals in the interval index. Since the default for interval index is right-closed you'll want to check the "end" end points. If we created the interval index with left closed we'd check the "start" end points.
Next we use piso.lookup
which takes a dataframe or series, indexed by an interval index, and uses it as a look up table for an array of points. Eg
containers = piso.lookup(long_ii, short["end"])
containers
looks like this
end
0.55 AAA
0.70 AAA
1.20 BBB
1.30 CCC
1.40 CCC
1.50 CCC
1.60 CCC
2.00 CCC
Name: tag, dtype: object
Then we use pandas.Series.value_counts
container_counts = containers.value_counts()
container_counts
looks like this
CCC 5
AAA 2
BBB 1
Name: tag, dtype: int64
You can then join this Series to your original dataframe, based on tag to get the desired result.
summary
import piso
long_ii = long.set_index(pd.IntervalIndex.from_arrays(long["start"], long["end"]))["tag"]
piso.lookup(long_ii, short["end"]).value_counts()