I have a data frame and I would like to create bins based on the following logic: The threshold for binning is 5.
the first bin (bin1) starts with the first marker m1. Then we compute the difference in position with the next marker m2, which gives Position(m2) - Position(m1) = 0.5-0 = 0.5. Because the difference is < the threshold, m2 belongs to bin1. We then move to the next marker m3. We repeat the process Position(m3) - Position(m1) = 0.6, because 0.6 < threshold m3 belongs to bin1.
we continue the same operation until the difference with the first marker becomes greater than the threshold. Thus, because Position(m6) - Position(m1) = 7, which is > threshold, m6 doesn't belong to bin1 and becomes the first marker of bin2. We repeat the same process: Position(m7) - Position(m6) = 1.4 which is < threshold, therefore m7 belongs to bin2. I hope you get the idea.
The expected output for this example is
bin1 = ['m1','m2','m3','m4','m5'] bin2 = ['m6','m7'] bin3 = ['m8','m9','m10']
There are lots of questions about binning and the answers refer to qcut and cut. But I am not sure they work for my case, or I'm not sure how to apply that to my case. Thank you in advance for your time.
df = pd.DataFrame({'Marker': ['m1','m2','m3','m4','m5','m6','m7','m8','m9','m10'],
'Position': [0,0.5,0.6,2,5,7,8.4,15,16,17]})
CodePudding user response:
Use floordiv
:
df['Cluster'] = df['Position'].floordiv(5).astype(int).factorize()[0] 1
Output:
>>> df
Marker Position Cluster
0 m1 0.0 1
1 m2 0.5 1
2 m3 0.6 1
3 m4 2.0 1
4 m5 5.0 2
5 m6 7.0 2
6 m7 8.4 2
7 m8 15.0 3
8 m9 16.0 3
9 m10 17.0 3