bins = np.arange(0, 189, 6)
bins
returns
array([ 0, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 72,
78, 84, 90, 96, 102, 108, 114, 120, 126, 132, 138, 144, 150,
156, 162, 168, 174, 180, 186])
which I then use to categorize a column of differences
df['diffs'] = pd.cut(df['differences'], bins =bins)
df.day_diff_range.value_counts()
resulting in this:
(0, 6] 1744
(6, 12] 1199
(12, 18] 1003
(18, 24] 934
(24, 30] 815
(30, 36] 754
etc
However, I want the ranges to be like so: [0, 6], [7, 13], [14, 20] and so on where both points of each bin are inclusive and the next bin adds 1 to the max of the previous bin.
CodePudding user response:
Adding custom labels to a binned DataFrame, can be accomplished by passing the labels
argument into the pd.cut()
method. Documentation reference here.
In the example below, the labels
are built using list comprehension and zip
to offset of the bins, for simplicity and clarity.
Additionally, I've increased the bin size to 7 (from 6), and added the right=False
argument to ensure proper bin alignment.
import pandas as pd
import random
# Create a random set of testing values.
random.seed(73)
vals = [random.randint(1,188) for _ in range(1000)]
# Create a testing DataFrame.
df = pd.DataFrame({'vals': vals})
# Create bins and bin labels.
bins = range(0, 189, 7)
labels = [f'[{a},{b-1}]' for a, b in zip(bins, bins[1:])]
# Apply the bins and labels.
pd.cut(df['vals'],
bins=bins,
right=False,
labels=labels).value_counts().sort_index()
Example output:
[0,6] 32
[7,13] 39
[14,20] 28
... ...
[161,167] 45
[168,174] 32
[175,181] 33
Name: vals, dtype: int64
Unlabeled output, for accuracy comparison. Produced using the same pd.cut(...)
statement, without the labels
argument.
[0, 7) 32
[7, 14) 39
[14, 21) 28
... ...
[161, 168) 45
[168, 175) 32
[175, 182) 33
Name: vals, dtype: int64