np.where not working as expected in pandas groupby/apply function-CodePudding

I use an np.where() statement to define a Pandas column. It works perfectly fine outside of a Pandas groupby/apply function, but seems to fail inside of the groupby/apply function.

Here's original dataframe:

unit of measure | city
---------------   ---------
NaN             | 'Atlanta'
'SF'            | 'Phoenix'
'Acre'          | 'Los Angeles'

Here's the np.where() statement:

testing['regex_unit'] = np.where(testing['unit of measure'].notna(),
                                testing['unit of measure'].str.lower(),
                                testing['city'])

Result (outside of groupby/apply):

unit of measure | city          | regex_unit
---------------   -------------   ----------
Nan             | 'Atlanta'     | 'Atlanta'
'SF'            | 'Phoenix'     | 'sf'
'Acre'          | 'Los Angeles' | 'acre'

But when I group by 'city' and run the np.where() inside of an apply function...

def apply_function(df):
    
    # Make all string columns title case
    for col in df.columns:
        if (df[col].dtype == 'object'):
            df[col] = df[col].apply(lambda x: str(x).title())

    # Replace string "Nans" with NaN
    df = df.replace('Nan', np.nan)
        
    # Replace 'No Zoning Data Available' with NaN
    df = df.replace('No Zoning Data Available', np.nan)

    # Double check the dataframe and column dtype
    display(df)
    print(df['unit of measure'].iloc[0])
    print(df['unit of measure'].isnull())

    df['regex_unit'] = np.where(df['unit of measure'].notna(),
                                df['unit of measure'].str.lower(),
                                df['city'])
    return df

new_df = testing.groupby(['city'], as_index=False).apply(apply_function)

I get this error...

unit of measure | city
---------------   ---------
NaN             | 'Atlanta'

nan
0    True
Name: unit of measure, dtype: bool

     2     df['regex_unit'] = np.where(df['unit of measure'].notna(),
---> 3                                 df['unit of measure'].str.lower(),
     4                                 df['regex_unit_temp'])

AttributeError: Can only use .str accessor with string values!

Why is np.where() acting different inside of the function applied to a groupby dataframe? What am I not seeing or understanding?

EDIT: When I comment out df = df.replace('Nan', np.nan), everything works.

I added print statements just before the np.where() statement to show that it is indeed a null value, and therefore should apply the 2nd logic (df['regex_unit_temp']), not the first (df['unit of measure'].str.lower()).

What am I not understanding about how df = df.replace('Nan', np.nan) is used in this function?

CodePudding user response：

I have figured out the root problem, although I lack understanding of why it causes a problem, so any thoughts or further explanation would be appreciated.

The problem is with this line being used in the np.where() statement:

df['unit of measure'].str.lower()

The np.where() statement works correctly if changed to this:

df['regex_unit'] = np.where(df['unit of measure'].notna(),
                                True,
                                False)

And works correctly if changed to this:

df['regex_unit'] = np.where(df['unit of measure'].notna(),
                            df['unit of measure'],
                            df['city'])

But the problem occurs when we turn df['unit of measure'] into df['unit of measure'].str.lower().

df['unit of measure'] returns a series, and df['unit of measure'].str.lower() also returns a series. I'm not sure why this would cause a problem if we are only applying the str.lower() method on a cell that is notna(). Again, any further clarification would be appreciated, but for now my question is answered.