pandas group by on a subset returns wrong number of rows-CodePudding

I am dramatically simplifying this question (compared to the initial draft).

Given a dataframe with a categorical type, if I subset the dataframe so it only contains a few of the categorical values, the subsetted dataframe still remembers the full list of category levels.

When I groupby on this, smaller, dataframe, I continue to get values which I don't expect.

Here is some code which replicates this:

import pandas as pd

df = pd.DataFrame({
    'colA':['a', 'b', 'b', 'c', 'c', 'c'], 
    'colB':1
})
df['colA'] = df['colA'].astype('category') #<= this is important
df

output

    colA    colB
0   a   1
1   b   1
2   b   1
3   c   1
4   c   1
5   c   1

df.groupby('colA').max()

output

colB
colA    
a   1
b   1
c   1

smalldf = df.iloc[:3]
smalldf['colA']

output

0    a
1    b
2    b
Name: colA, dtype: category
Categories (3, object): ['a', 'b', 'c']

Notice above that 'c' is still in the category levels

smalldf.groupby('colA').max()

output

    colB
colA    
a   1.0
b   1.0
c   NaN

This 'c' here is unexpected. However, I suppose it can be explained away because subsetting doesn't remove the category levels.

The problem gets worse when you write out the smaller dataframe

FEATHER_PATH = Path.joinpath(Path.home(), 'Downloads', 'test.feather')
smalldf.to_feather(FEATHER_PATH)

pd.read_feather(FEATHER_PATH)['colA']

0    a
1    b
2    b
Name: colA, dtype: category
Categories (3, object): ['a', 'b', 'c']

Say you write out the smaller dataframe and pass it to someone. They take a look at the file and only see colA values of 'a' and 'b'. However, when they do a groupby, 'c' shows up. 'c' is no where to be found in the actual data!

They would have to explicitly look for category levels to find the 'c' hidden in there!

I'm not sure if this behavior is wrong, but can certainly can be extremely confusing!

Not sure if this can be left here as a warning to others.

CodePudding user response：

It's not wrong, but it is a bit hidden in the docs. You can find it for example here:

DataFrame methods like DataFrame.sum() also show “unused” categories.

Groupby will also show “unused” categories

Regarding the writing/reading: Without it you would loose type information, which might hurt as well. So there's a tradeoff to balance.

You should be able to handle it by using the observed=True option of .groupby():

observed: bool, default False

This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

For example:

smalldf.groupby('colA', observed=True).max()

delivers

      colB
colA      
a        1
b        1

CodePudding user response：

This is correct, the length of the DataFrame based on groups.get_group('colA') will have the same length as the unique values of the 'colA' column.

Edit: That assumes that 'colA' is your biggest column.