Home > Mobile >  (Pandas) How to find what bin events fall in, where bins are defined by separate dataframe
(Pandas) How to find what bin events fall in, where bins are defined by separate dataframe

Time:04-06

I have one Pandas dataframe bins with checkpoints that split a number line into bins. The index is the name of that bin and the value in 'st' is the starting point of the bin eg:

     st
0:   0
1:   4
2:   6
3:   11
4:   17

Thus bin 2 is the space from 6 to 11

I then have another dataframe events with events that have a location in the number line, eg:

     loc
0:   1
1:   3
2:   6
3:   8
4:   10
5:   11
6:   14
7:   18
8:   20

I would like to find for each event in events, what bin it falls in, and what is the difference to the beginning to the bin, eg:

     loc   seg  diff
0:   1     0    1
1:   3     0    3
2:   6     2    0
3:   8     2    2
4:   10    2    4
5:   11    3    0
6:   14    3    3
7:   18    4    1
8:   20    4    3

I was able to make it work using a for-loop, somewhat optimized by the fact that both dataframes are sorted, but I wanted to know if there is a good way to solve this without loops.

In reality, my dataframe bins is 100's of rows and events is 10000's, and the values in st and loc are datetime objects, so I would like to find diff as timedeltas.

CodePudding user response:

In your case do with pd.cut

events['seg'] = pd.cut(events['loc'],st['st'].tolist() [np.inf],labels=st.index,right=False)
events['diff'] = events['loc'] - events['seg'].map(st['st']).astype(int)

events
Out[31]: 
   loc seg  diff
0    1   0     1
1    3   0     3
2    6   2     0
3    8   2     2
4   10   2     4
5   11   3     0
6   14   3     3
7   18   4     1
8   20   4     3

CodePudding user response:

Since your st column is sorted, you can use numpy.searchsorted() to find indices where elements should be inserted to maintain order.

import numpy as np

events['seg'] = np.searchsorted(bins['st'], events['loc'])-1
   loc  seg
0    1    0
1    3    0
2    6    1
3    8    2
4   10    2
5   11    2
6   14    3
7   18    4
8   20    4

Then map the newly created indices column to st Series to get corresponding value

events['diff'] = events['loc'] - events['seg'].map(bins['st'])
   loc  seg  diff
0    1    0     1
1    3    0     3
2    6    1     2
3    8    2     2
4   10    2     4
5   11    2     5
6   14    3     3
7   18    4     1
8   20    4     3
  • Related