Home > database >  pandas cut preserving nans when the binning boundaries are not found in the group by function
pandas cut preserving nans when the binning boundaries are not found in the group by function

Time:12-08

I am getting strange behaviour in pandas cut function. Suppose I have this dataframe:

df = pd.DataFrame([1, 4, 8, 9], columns=['A'])

and I want to do binning based on this values.

bins = list(range(0, 10))

As Normally, I would expect like this:

df['binned'] = pd.cut(df['A'], bins=bins)

print(df)

A  binned
1  (0, 1]
4  (3, 4]
8  (7, 8]
9  (8, 9]

So, far all good. But when I try to groupby on the binned columns; suddenly those extra NANS are coming.

df = df.groupby('binned', as_index=False).max()
print(df)

binned    A
(0, 1]  1.0
(1, 2]  NaN
(2, 3]  NaN
(3, 4]  4.0
(4, 5]  NaN
(5, 6]  NaN
(6, 7]  NaN
(7, 8]  8.0
(8, 9]  9.0

why those NANS binning were preserved. If it was there from the beginning, why it was showing only in the groupby but doesn't show before.

If this is expected behaviour, then how can I remove those NANS before going in to the groupby function.

I even did the dropna before doing the groupby but that doesn't help because there was no NANS rows and it doesn't do anything.

CodePudding user response:

You need to set observed=True, because your 'Binned' column contains categorical values. In categorical data, all categories are preserved.

df.groupby('binned', as_index=False, observed=True).max()

As you can see when you check df['binned'].dtype, the type is: CategoricalDtype(categories=[(0, 1], (1, 2], (2, 3], (3, 4], (4, 5], (5, 6], (6, 7], (7, 8], (8, 9]], ordered=True)

So this is where the information is preserved. Not in the values but in the datatype of the column.

From the documentation on groupby:

observed : bool, default False

This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

  • Related