I have a data frame that looks like this
data = [['A', 0.20], ['B',0.25], ['C',0.11], ['D',0.30], ['E',0.29]]
df = pd.DataFrame(data, columns=['col1', 'col2'])
Col1 is a primary key (each row has a unique value)
The max of col2 is 1 and the min is 0. I want to find the number of datapoint in ranges 0-.30 (both 0 and 0.30 are included), 0-.29, 0-.28, and so on till 0-.01. I can use pd.cut, but the lower limit is not fixed. My lower limit is always 0. Can someone help?
CodePudding user response:
One option using numpy broadcasting:
step = 0.01
up = np.arange(0, 0.3 step, step)
out = pd.Series((df['col2'].to_numpy()[:,None] <= up).sum(axis=0), index=up)
Output:
0.00 0
0.01 0
0.02 0
0.03 0
0.04 0
0.05 0
0.06 0
0.07 0
0.08 0
0.09 0
0.10 0
0.11 1
0.12 1
0.13 1
0.14 1
0.15 1
0.16 1
0.17 1
0.18 1
0.19 1
0.20 2
0.21 2
0.22 2
0.23 2
0.24 2
0.25 3
0.26 3
0.27 3
0.28 3
0.29 4
0.30 5
dtype: int64
With pandas.cut
and cumsum
:
step = 0.01
up = np.arange(0, 0.3 step, step)
(pd.cut(df['col2'], up, labels=up[1:].round(2))
.value_counts(sort=False).cumsum()
)
Output:
0.01 0
0.02 0
0.03 0
0.04 0
0.05 0
0.06 0
0.07 0
0.08 0
0.09 0
0.1 0
0.11 1
0.12 1
0.13 1
0.14 1
0.15 1
0.16 1
0.17 1
0.18 1
0.19 1
0.2 2
0.21 2
0.22 2
0.23 2
0.24 2
0.25 3
0.26 3
0.27 3
0.28 3
0.29 4
0.3 5
Name: col2, dtype: int64