How find count of rows between two timestamps?-CodePudding

I have one table, called df_1

   start_time  end_time
0           0        30
1          77       107
2         220       250
3         268       298
4         391       421

I would like to know how many rows there are in another dataframe df_2 between each start_time and end_time of df_1. df_2 looks like this

So, the result would look like this

   start_time  end_time  count
0           0        30      0
1          77       107      1
2         220       250      0
3         268       298      0
4         391       421      2

Any idea how this could be done?

CodePudding user response：

There are plenty of useful built-in functions for doing what you are going to do. Here I used between (its document is available here). The implementation would be like this:

df1 = pd.DataFrame({'Start':[0, 77, 391],
             'End':[30, 107, 421]})
df2 = pd.DataFrame({'time':[100, 392, 399]})

df1['Count'] = df1.apply(lambda row:df2.time.between(row['Start'], row['End']).astype(int).sum(), axis=1)
df1

output:

   Start  End  Count
0      0   30      0
1     77  107      1
2    391  421      2

CodePudding user response：

On top of my head, i think

df_1["count"] = df_1.apply(lambda x: df_2.time.between(x[0],x[1]).sum(),axis=1)

Should work, not sure about the time complexity.

Output

   start_time  end_time  count
0           0        30      0
1          77       107      1
2         220       250      0
3         268       298      0
4         391       421      2

CodePudding user response：

There are a number of options for this; the first involves IntervalIndex:

from collections import Counter
intervals = pd.IntervalIndex.from_arrays(df1.start_time, df1.end_time, closed='both')

df1.index = intervals

filtered = intervals[intervals.get_indexer(df2.time)]

counts = Counter(filtered)

df1.loc[counts.keys(), 'count'] = [*counts.values()]

df1.fillna(0, downcast='infer').reset_index(drop=True)


   start_time  end_time  count
0           0        30      0
1          77       107      1
2         220       250      0
3         268       298      0
4         391       421      2

The other option involves conditional_join from pyjanitor, which can help to abstract inequality joins:

# pip install pyjanitor
import pandas as pd
import janitor

(df1.conditional_join(
         df2, 
        ('start_time', 'time', '<='), 
        ('end_time', 'time', '>='), 
        how = 'left')
    .groupby(['start_time', 'end_time'], as_index = False)
    .agg(count=('time', 'count'))
)

   start_time  end_time  count
0           0        30      0
1          77       107      1
2         220       250      0
3         268       298      0
4         391       421      2

The IntervalIndex option should be faster though, as the end goal is an aggregation (the intervalIndex option skips creating a dataframe)