Pandas: How to display inclusive starting and ending bin labels-CodePudding

bins = np.arange(0, 189, 6)
bins

returns

array([  0,   6,  12,  18,  24,  30,  36,  42,  48,  54,  60,  66,  72,
        78,  84,  90,  96, 102, 108, 114, 120, 126, 132, 138, 144, 150,
       156, 162, 168, 174, 180, 186])

which I then use to categorize a column of differences

df['diffs'] = pd.cut(df['differences'], bins =bins)
df.day_diff_range.value_counts()

resulting in this:

(0, 6]        1744
(6, 12]       1199
(12, 18]      1003
(18, 24]       934
(24, 30]       815
(30, 36]       754
etc

However, I want the ranges to be like so: [0, 6], [7, 13], [14, 20] and so on where both points of each bin are inclusive and the next bin adds 1 to the max of the previous bin.

CodePudding user response：

Adding custom labels to a binned DataFrame, can be accomplished by passing the labels argument into the pd.cut() method. Documentation reference here.

In the example below, the labels are built using list comprehension and zip to offset of the bins, for simplicity and clarity.

Additionally, I've increased the bin size to 7 (from 6), and added the right=False argument to ensure proper bin alignment.

import pandas as pd
import random

# Create a random set of testing values.
random.seed(73)
vals = [random.randint(1,188) for _ in range(1000)]
# Create a testing DataFrame.
df = pd.DataFrame({'vals': vals})

# Create bins and bin labels.
bins = range(0, 189, 7)
labels = [f'[{a},{b-1}]' for a, b in zip(bins, bins[1:])]

# Apply the bins and labels.
pd.cut(df['vals'], 
       bins=bins, 
       right=False, 
       labels=labels).value_counts().sort_index()

Example output:

[0,6]        32
[7,13]       39
[14,20]      28
...          ...
[161,167]    45
[168,174]    32
[175,181]    33
Name: vals, dtype: int64

Unlabeled output, for accuracy comparison. Produced using the same pd.cut(...) statement, without the labels argument.

[0, 7)        32
[7, 14)       39
[14, 21)      28
...           ...
[161, 168)    45
[168, 175)    32
[175, 182)    33
Name: vals, dtype: int64