Distribute data equally in all the bins based on a column : Pandas-CodePudding

I have large pandas dataframe; sample shown below :

I want to evenly distribute the "Good" flag between the ranges; so that each score range has equal number of goods.

Score   Good
100    0
100    0
100    0
300    0
400    0
400    0
600    1
600    1
600    0
650    0
650    0
650    1
700    1
770    1
770    1
800    0
890    1
890    1

Sample Output:

bins       Goods
100 - 600   2
650-700     2
770-800     2
> 890       2

I tried using pd.cut and pd.qcut but didn't able to figure this out.

CodePudding user response：

You can combine qcut and cut:

# split in defined number of bins considering only Good == 1
N = 4
bins = pd.qcut(df.query('Good == 1')['Score'], N).cat.categories.tolist()

# extend min/max of first/last bins
bins[0] = pd.Interval(df['Score'].min()-0.01, bins[0].right)
bins[-1] = pd.Interval(bins[-1].left,  df['Score'].max())

# now we have 4 bins with "equal"* number of Good == 1
# (*in the limit of being able to split equally)
df['bin'] = pd.cut(df['Score'], bins=pd.IntervalIndex(bins))

NB. this doesn't require the data to be sorted.

Example:

    Score  Good             bin
0     100     0  (99.99, 637.5] # bin 1
1     100     0  (99.99, 637.5]
2     100     0  (99.99, 637.5]
3     300     0  (99.99, 637.5]
4     400     0  (99.99, 637.5]
5     400     0  (99.99, 637.5]
6     600     1  (99.99, 637.5]
7     600     1  (99.99, 637.5]
8     600     0  (99.99, 637.5]

9     650     0  (637.5, 735.0] # bin 2
10    650     0  (637.5, 735.0]
11    650     1  (637.5, 735.0]
12    700     1  (637.5, 735.0]

13    770     1  (735.0, 800.0] # bin 3
14    770     1  (735.0, 800.0]
15    800     0  (735.0, 800.0]

16    890     1  (800.0, 890.0] # bin 4
17    890     1  (800.0, 890.0]

Example with N = 2:

    Score  Good             bin
0     100     0  (99.99, 735.0] # bin 1
1     100     0  (99.99, 735.0]
2     100     0  (99.99, 735.0]
3     300     0  (99.99, 735.0]
4     400     0  (99.99, 735.0]
5     400     0  (99.99, 735.0]
6     600     1  (99.99, 735.0]
7     600     1  (99.99, 735.0]
8     600     0  (99.99, 735.0]
9     650     0  (99.99, 735.0]
10    650     0  (99.99, 735.0]
11    650     1  (99.99, 735.0]
12    700     1  (99.99, 735.0]

13    770     1  (735.0, 890.0] # bin 2
14    770     1  (735.0, 890.0]
15    800     0  (735.0, 890.0]
16    890     1  (735.0, 890.0]
17    890     1  (735.0, 890.0]

With N = 3 (not a divisor of the number of Goods), this tries to split as fairly as possible, leaving I've category with 1 less Good:

    Score  Good               bin
0     100     0  (99.99, 666.667] # bin 1
1     100     0  (99.99, 666.667]
2     100     0  (99.99, 666.667]
3     300     0  (99.99, 666.667]
4     400     0  (99.99, 666.667]
5     400     0  (99.99, 666.667]
6     600     1  (99.99, 666.667]
7     600     1  (99.99, 666.667]
8     600     0  (99.99, 666.667]
9     650     0  (99.99, 666.667]
10    650     0  (99.99, 666.667]
11    650     1  (99.99, 666.667]

12    700     1  (666.667, 770.0] # bin 2
13    770     1  (666.667, 770.0]
14    770     1  (666.667, 770.0]

15    800     0    (770.0, 890.0] # bin 3
16    890     1    (770.0, 890.0] # only 2 Good
17    890     1    (770.0, 890.0]

aggregation

df.groupby(pd.cut(df['Score'], bins=pd.IntervalIndex(bins)))['Good'].sum()

Output:

Score
(99.99, 637.5]    2
(637.5, 735.0]    2
(735.0, 800.0]    2
(800.0, 890.0]    2
Name: Good, dtype: int64

CodePudding user response：

Example

data = [[100, 0], [100, 0], [100, 0], [300, 0], [400, 0], [400, 0], [600, 1], [600, 1], [600, 0], 
        [650, 0], [650, 0], [650, 1], [700, 1], [770, 1], [770, 1], [800, 0], [890, 1], [890, 1]]
df = pd.DataFrame(data, columns=['Score', 'Good'])

Code

n = 4
grouper = pd.cut(df.sort_values(['Score', 'Good'], ascending=[1, 0])['Good'].cumsum(), bins=n)
mapper = df.groupby(grouper)['Score'].agg([min, max]).astype('str').apply(lambda x: '-'.join(x), axis=1)
df1 = df.groupby(grouper)['Good'].sum()
df1.index = df1.index.map(mapper).astype('str')
df1 = df1.reset_index(name='1').set_axis(['bins', 'Good'], axis=1)
df1.iloc[-1, 0] = '>= '   df1.iloc[-1, 0].split('-')[0]

df1

    bins    Good
0   100-600 2
1   650-700 2
2   770-800 2
3   >= 890  2

I'll try to find easier way