How does binning work in pandas dataframe and how can I classify my dataset based on percentiles in-CodePudding

I have a pandas dataframe df which looks as follows. df.to_dict() is given at the end of the question.

Values
0   0.010545
1   0.018079
2   0.019491
3   0.042556
4   0.062404
5   0.077826
6   0.080170
7   0.085732
8   0.097538
9   0.104020
10  0.116825
11  0.121143
12  0.147592
13  0.147939
14  0.154998
15  0.157179
16  0.185593
17  0.200474

Since I have 18 values, I want to classify them into 3 different bins as "Low", "Medium" and "High". I also want to know what is the threshold for each of these bins.

I did

df.apply(lambda x:pd.cut(x, bins = 3, labels=['low','medium','high']), axis = 0)

and got the following.

Values
0   low
1   low
2   low
3   low
4   low
5   medium
6   medium
7   medium
8   medium
9   medium
10  medium
11  medium
12  high
13  high
14  high
15  high
16  high
17  high

I got 5 values for low, 7 for medium and and 6 for high. I am curious why I did not get 6 values for each of low, medium and high since I have 18 values which is divisible by 3.

I also tried to calculate the 33rd and 67th percentile. df.quantile(0.33) gave me 0.079256 and df.quantile(0.67) gave me 0.131458.

I realized that this is not the exact threshold for low-medium or medium-high for the function that I applied. Because the value in index 5 which is 0.077826 is lower than 33rd percentile (0.079256), but is categorized as "medium".

Is my way of calculating percentiles correct? Does binning also apply the threshold in the same manner? How do I get the threshold for low-medium and medium-high when I applied pd.cut(bins = 3,...) function above?

Is it possible to classify the dataset into equal numbers of low, medium and high based on [0, 0.33, 0.67, 1] as percentiles? Initially, I thought that my function did the same, but it does not look like that.

df.to_dict() is as shown:

{'Values': {0: 0.0105451195503243,
  1: 0.01807949818715662,
  2: 0.01949062427056047,
  3: 0.04255627128922379,
  4: 0.06240376897660298,
  5: 0.07782590379116708,
  6: 0.0801695217422988,
  7: 0.0857317068170362,
  8: 0.0975380806573516,
  9: 0.1040201601240209,
  10: 0.1168250536954563,
  11: 0.1211426350058809,
  12: 0.1475922708843568,
  13: 0.1479393893305906,
  14: 0.1549975110559438,
  15: 0.157178926862648,
  16: 0.1855926516856752,
  17: 0.2004743800065415}}

CodePudding user response：

Quoting from pd.cut's documentation, under the bin parameter's description:

int : Defines the number of equal-width bins in the range of x. [...]

This means that in your case the range of the series (i.e. max value minus min value) is divided into 3 segments of equal length (0.06331 in this case). So you are not guaranteed to get bins of the same size.

For example

df = pd.DataFrame([1, 2, 3, 4, 5, 100])
df.apply(lambda x:pd.cut(x, bins = 3, labels=['low','medium','high']), axis = 0)

will return

0   low
1   low
2   low
3   low
4   low
5   high

because the bin edges are 33 and 66 (actual values are actually slightly different because pandas extends the range a little bit to include the minimum and maximum values of the array).

To answer your question, one thing you can do is sort the array and slice it into 3 parts. An example of code achieving this is the following:

# Sort array and keep track of old indexes to get back the original sorting at the end
df = df.sort_values('Values').reset_index()

# Split df into 3 chunks
chunk_size = len(df) // 3
for i, bin_label in zip(range(0, len(df), chunk_size), range(3)):
    df.loc[i: i   chunk_size, 'bin'] = bin_label
    
# Assign eventual remaining rows to last bin
df['bin'] = df['bin'].fillna(method='ffill')

# Go back to the original sorting
df = df.set_index('index').sort_index()

I considered the case when the array is not necessarily sorted. If your array is already sorted like in your example, then you can just do

# Split df into 3 chunks
chunk_size = len(df) // 3
for i, bin_label in zip(range(0, len(df), chunk_size), range(3)):
    df.loc[i: i   chunk_size, 'bin'] = bin_label

# Assign eventual remaining rows to last bin
df['bin'] = df['bin'].fillna(method='ffill')

After that you can rename the bins as you want, or you can replace range(3) in the for loop line by a list like ['low', 'medium', 'high'].

CodePudding user response：

I figured out that df.apply(lambda x:pd.qcut(x, q = [0, 0.33, 0.67, 1], labels=['low','medium','high']), axis = 0) gives me the classification based on 0 to 33rd percentile, 33rd percentile to 67th percentile, and up to 1.

But sometimes it does not work if there are non-unique bin edges. e.g. if I have [1,1,1,1,1,2,3,4], the 33rd percentile is 1 and 67th percentile is 1.69. But I cannot divide the data equally based on 33rd and 67th percentiles as thresholds.

I also figures out what pd.cut(x, bins = n) takes for threshold. From my understanding, if I have n = 3, it tries to break my data in terms of range/n. The low-medium threshold is (min range/n). e.g. in [1, 1, 1, 1, 1, 2, 3, 4]. The low-medium threshold is 1 (4-1)/3 = 1 1 = 2. But 2 is still in low. And anything above 2 is medium.

Similarly, medium-high threshold is max - (range/n). In the data above, it is 4 - 3/3 = 4-1 = 3. But 3 is still medium, and anything above 3 is categorized as high.