I have one Pandas dataframe bins
with checkpoints that split a number line into bins. The index is the name of that bin and the value in 'st'
is the starting point of the bin eg:
st
0: 0
1: 4
2: 6
3: 11
4: 17
Thus bin 2 is the space from 6 to 11
I then have another dataframe events
with events that have a location in the number line, eg:
loc
0: 1
1: 3
2: 6
3: 8
4: 10
5: 11
6: 14
7: 18
8: 20
I would like to find for each event in events
, what bin it falls in, and what is the difference to the beginning to the bin, eg:
loc seg diff
0: 1 0 1
1: 3 0 3
2: 6 2 0
3: 8 2 2
4: 10 2 4
5: 11 3 0
6: 14 3 3
7: 18 4 1
8: 20 4 3
I was able to make it work using a for-loop, somewhat optimized by the fact that both dataframes are sorted, but I wanted to know if there is a good way to solve this without loops.
In reality, my dataframe bins
is 100's of rows and events
is 10000's, and the values in st
and loc
are datetime objects, so I would like to find diff
as timedeltas.
CodePudding user response:
In your case do with pd.cut
events['seg'] = pd.cut(events['loc'],st['st'].tolist() [np.inf],labels=st.index,right=False)
events['diff'] = events['loc'] - events['seg'].map(st['st']).astype(int)
events
Out[31]:
loc seg diff
0 1 0 1
1 3 0 3
2 6 2 0
3 8 2 2
4 10 2 4
5 11 3 0
6 14 3 3
7 18 4 1
8 20 4 3
CodePudding user response:
Since your st
column is sorted, you can use numpy.searchsorted() to find indices where elements should be inserted to maintain order.
import numpy as np
events['seg'] = np.searchsorted(bins['st'], events['loc'])-1
loc seg
0 1 0
1 3 0
2 6 1
3 8 2
4 10 2
5 11 2
6 14 3
7 18 4
8 20 4
Then map the newly created indices column to st
Series to get corresponding value
events['diff'] = events['loc'] - events['seg'].map(bins['st'])
loc seg diff
0 1 0 1
1 3 0 3
2 6 1 2
3 8 2 2
4 10 2 4
5 11 2 5
6 14 3 3
7 18 4 1
8 20 4 3