Home > Blockchain >  pandas cut - different bins for different labels
pandas cut - different bins for different labels

Time:08-04

I have a data frame with 2 different labels, A and B, and an associated numeric value. I want to add a column giving the label of a custom bin that the numeric value falls in to, which can be achieved with pd.cut() as follows:

df = pd.DataFrame({"label": ['A','A','A','A','A','A','B','B','B','B'],
                   "num":   [ 1 , 2 , 4 , 5 , 10, 11, 1 , 3 , 4 , 5 ]})

df['Bin'] = pd.cut(df["num"],
                   [0, 4.5, 7.5, np.inf],
                   labels=['0-4', '5-8', '>8'],
                   include_lowest=True)

giving:

  label  num  Bin
0     A    1  0-4
1     A    2  0-4
2     A    4  0-4
3     A    5  5-8
4     A   10   >8
5     A   11   >8
6     B    1  0-4
7     B    3  0-4
8     B    4  0-4
9     B    5  5-8

However, this works well for A, but the values of B are such that the most values fall into the bottom bin, so I'd like to increase the resolution with different bins for A and B to produce the following:

  label  num  Bin
0     A    1  0-4
1     A    2  0-4
2     A    4  0-4
3     A    5  5-8
4     A   10   >8
5     A   11   >8
6     B    1  0-2
7     B    3  2-4
8     B    4  2-4
9     B    5   >4

It feels like this should be possible using a conditional such as df.where(), or maybe a groupby with a transform() or apply(), or list comprehension with if, but I have been reading stackoverflow and messing around all day and not managed to achieve anything.

I guess I could separate into individual data frames based on label, perform a custom cut to this sub-dataframue, and then concatenate the results back together, but this doesn't feel very pythonic, or lend itself to generalisable code.

PS - This is a minimal example, my real data frame has more label values, and I want to keep it as a single data frame with differing bins for further processing in my code, hence not separating into two separate data frames based on label.

CodePudding user response:

Yes, groupby().apply() is a good choice, for example, you can do:

df['Bin'] = df.groupby('label')['num'].apply(pd.cut,bins=3)

Output:

  label  num             Bin
0     A    1   (0.99, 4.333]
1     A    2   (0.99, 4.333]
2     A    4   (0.99, 4.333]
3     A    5  (4.333, 7.667]
4     A   10   (7.667, 11.0]
5     A   11   (7.667, 11.0]
6     B    1  (0.996, 2.333]
7     B    3  (2.333, 3.667]
8     B    4    (3.667, 5.0]
9     B    5    (3.667, 5.0]

Or, if you have a specific bins/labels mapping for each label, you can go like this:

bins = {'A': [0,4.5,7.5, np.inf], 'B': [0,2.5,4.5,np.inf]}
labels={'A':['0-4', '5-8', '>8'], 'B': ['0-2','2-4','>4']}
def my_cut(data, bins, labels):
    label = data['label'].iloc[0]
    return pd.cut(data['num'], bins=bins[label], labels=labels[label])

df['Bin'] = df.groupby('label', group_keys=False).apply(my_cut, bins=bins, labels=labels)

Output:

  label  num  Bin
0     A    1  0-4
1     A    2  0-4
2     A    4  0-4
3     A    5  5-8
4     A   10   >8
5     A   11   >8
6     B    1  0-2
7     B    3  2-4
8     B    4  2-4
9     B    5   >4
  • Related