I have one table, called df_1
start_time end_time
0 0 30
1 77 107
2 220 250
3 268 298
4 391 421
I would like to know how many rows there are in another dataframe df_2 between each start_time and end_time of df_1. df_2 looks like this
time
0 100
1 392
2 399
So, the result would look like this
start_time end_time count
0 0 30 0
1 77 107 1
2 220 250 0
3 268 298 0
4 391 421 2
Any idea how this could be done?
CodePudding user response:
There are plenty of useful built-in functions for doing what you are going to do. Here I used between
(its document is available here). The implementation would be like this:
df1 = pd.DataFrame({'Start':[0, 77, 391],
'End':[30, 107, 421]})
df2 = pd.DataFrame({'time':[100, 392, 399]})
df1['Count'] = df1.apply(lambda row:df2.time.between(row['Start'], row['End']).astype(int).sum(), axis=1)
df1
output:
Start End Count
0 0 30 0
1 77 107 1
2 391 421 2
CodePudding user response:
On top of my head, i think
df_1["count"] = df_1.apply(lambda x: df_2.time.between(x[0],x[1]).sum(),axis=1)
Should work, not sure about the time complexity.
Output
start_time end_time count
0 0 30 0
1 77 107 1
2 220 250 0
3 268 298 0
4 391 421 2
CodePudding user response:
There are a number of options for this; the first involves IntervalIndex:
from collections import Counter
intervals = pd.IntervalIndex.from_arrays(df1.start_time, df1.end_time, closed='both')
df1.index = intervals
filtered = intervals[intervals.get_indexer(df2.time)]
counts = Counter(filtered)
df1.loc[counts.keys(), 'count'] = [*counts.values()]
df1.fillna(0, downcast='infer').reset_index(drop=True)
start_time end_time count
0 0 30 0
1 77 107 1
2 220 250 0
3 268 298 0
4 391 421 2
The other option involves conditional_join from pyjanitor, which can help to abstract inequality joins:
# pip install pyjanitor
import pandas as pd
import janitor
(df1.conditional_join(
df2,
('start_time', 'time', '<='),
('end_time', 'time', '>='),
how = 'left')
.groupby(['start_time', 'end_time'], as_index = False)
.agg(count=('time', 'count'))
)
start_time end_time count
0 0 30 0
1 77 107 1
2 220 250 0
3 268 298 0
4 391 421 2
The IntervalIndex option should be faster though, as the end goal is an aggregation (the intervalIndex option skips creating a dataframe)