I have large pandas dataframe; sample shown below :
I want to evenly distribute the "Good" flag between the ranges; so that each score range has equal number of goods.
Score Good
100 0
100 0
100 0
300 0
400 0
400 0
600 1
600 1
600 0
650 0
650 0
650 1
700 1
770 1
770 1
800 0
890 1
890 1
Sample Output:
bins Goods
100 - 600 2
650-700 2
770-800 2
> 890 2
I tried using pd.cut
and pd.qcut
but didn't able to figure this out.
CodePudding user response:
# split in defined number of bins considering only Good == 1
N = 4
bins = pd.qcut(df.query('Good == 1')['Score'], N).cat.categories.tolist()
# extend min/max of first/last bins
bins[0] = pd.Interval(df['Score'].min()-0.01, bins[0].right)
bins[-1] = pd.Interval(bins[-1].left, df['Score'].max())
# now we have 4 bins with "equal"* number of Good == 1
# (*in the limit of being able to split equally)
df['bin'] = pd.cut(df['Score'], bins=pd.IntervalIndex(bins))
NB. this doesn't require the data to be sorted.
Example:
Score Good bin
0 100 0 (99.99, 637.5] # bin 1
1 100 0 (99.99, 637.5]
2 100 0 (99.99, 637.5]
3 300 0 (99.99, 637.5]
4 400 0 (99.99, 637.5]
5 400 0 (99.99, 637.5]
6 600 1 (99.99, 637.5]
7 600 1 (99.99, 637.5]
8 600 0 (99.99, 637.5]
9 650 0 (637.5, 735.0] # bin 2
10 650 0 (637.5, 735.0]
11 650 1 (637.5, 735.0]
12 700 1 (637.5, 735.0]
13 770 1 (735.0, 800.0] # bin 3
14 770 1 (735.0, 800.0]
15 800 0 (735.0, 800.0]
16 890 1 (800.0, 890.0] # bin 4
17 890 1 (800.0, 890.0]
Example with N = 2
:
Score Good bin
0 100 0 (99.99, 735.0] # bin 1
1 100 0 (99.99, 735.0]
2 100 0 (99.99, 735.0]
3 300 0 (99.99, 735.0]
4 400 0 (99.99, 735.0]
5 400 0 (99.99, 735.0]
6 600 1 (99.99, 735.0]
7 600 1 (99.99, 735.0]
8 600 0 (99.99, 735.0]
9 650 0 (99.99, 735.0]
10 650 0 (99.99, 735.0]
11 650 1 (99.99, 735.0]
12 700 1 (99.99, 735.0]
13 770 1 (735.0, 890.0] # bin 2
14 770 1 (735.0, 890.0]
15 800 0 (735.0, 890.0]
16 890 1 (735.0, 890.0]
17 890 1 (735.0, 890.0]
With N = 3
(not a divisor of the number of Goods), this tries to split as fairly as possible, leaving I've category with 1 less Good:
Score Good bin
0 100 0 (99.99, 666.667] # bin 1
1 100 0 (99.99, 666.667]
2 100 0 (99.99, 666.667]
3 300 0 (99.99, 666.667]
4 400 0 (99.99, 666.667]
5 400 0 (99.99, 666.667]
6 600 1 (99.99, 666.667]
7 600 1 (99.99, 666.667]
8 600 0 (99.99, 666.667]
9 650 0 (99.99, 666.667]
10 650 0 (99.99, 666.667]
11 650 1 (99.99, 666.667]
12 700 1 (666.667, 770.0] # bin 2
13 770 1 (666.667, 770.0]
14 770 1 (666.667, 770.0]
15 800 0 (770.0, 890.0] # bin 3
16 890 1 (770.0, 890.0] # only 2 Good
17 890 1 (770.0, 890.0]
aggregation
df.groupby(pd.cut(df['Score'], bins=pd.IntervalIndex(bins)))['Good'].sum()
Output:
Score
(99.99, 637.5] 2
(637.5, 735.0] 2
(735.0, 800.0] 2
(800.0, 890.0] 2
Name: Good, dtype: int64
CodePudding user response:
Example
data = [[100, 0], [100, 0], [100, 0], [300, 0], [400, 0], [400, 0], [600, 1], [600, 1], [600, 0],
[650, 0], [650, 0], [650, 1], [700, 1], [770, 1], [770, 1], [800, 0], [890, 1], [890, 1]]
df = pd.DataFrame(data, columns=['Score', 'Good'])
Code
n = 4
grouper = pd.cut(df.sort_values(['Score', 'Good'], ascending=[1, 0])['Good'].cumsum(), bins=n)
mapper = df.groupby(grouper)['Score'].agg([min, max]).astype('str').apply(lambda x: '-'.join(x), axis=1)
df1 = df.groupby(grouper)['Good'].sum()
df1.index = df1.index.map(mapper).astype('str')
df1 = df1.reset_index(name='1').set_axis(['bins', 'Good'], axis=1)
df1.iloc[-1, 0] = '>= ' df1.iloc[-1, 0].split('-')[0]
df1
bins Good
0 100-600 2
1 650-700 2
2 770-800 2
3 >= 890 2
I'll try to find easier way