I use an np.where()
statement to define a Pandas column. It works perfectly fine outside of a Pandas groupby/apply function, but seems to fail inside of the groupby/apply function.
Here's original dataframe:
unit of measure | city
--------------- ---------
NaN | 'Atlanta'
'SF' | 'Phoenix'
'Acre' | 'Los Angeles'
Here's the np.where() statement:
testing['regex_unit'] = np.where(testing['unit of measure'].notna(),
testing['unit of measure'].str.lower(),
testing['city'])
Result (outside of groupby/apply):
unit of measure | city | regex_unit
--------------- ------------- ----------
Nan | 'Atlanta' | 'Atlanta'
'SF' | 'Phoenix' | 'sf'
'Acre' | 'Los Angeles' | 'acre'
But when I group by 'city' and run the np.where() inside of an apply function...
def apply_function(df):
# Make all string columns title case
for col in df.columns:
if (df[col].dtype == 'object'):
df[col] = df[col].apply(lambda x: str(x).title())
# Replace string "Nans" with NaN
df = df.replace('Nan', np.nan)
# Replace 'No Zoning Data Available' with NaN
df = df.replace('No Zoning Data Available', np.nan)
# Double check the dataframe and column dtype
display(df)
print(df['unit of measure'].iloc[0])
print(df['unit of measure'].isnull())
df['regex_unit'] = np.where(df['unit of measure'].notna(),
df['unit of measure'].str.lower(),
df['city'])
return df
new_df = testing.groupby(['city'], as_index=False).apply(apply_function)
I get this error...
unit of measure | city
--------------- ---------
NaN | 'Atlanta'
nan
0 True
Name: unit of measure, dtype: bool
2 df['regex_unit'] = np.where(df['unit of measure'].notna(),
---> 3 df['unit of measure'].str.lower(),
4 df['regex_unit_temp'])
AttributeError: Can only use .str accessor with string values!
Why is np.where() acting different inside of the function applied to a groupby dataframe? What am I not seeing or understanding?
EDIT: When I comment out df = df.replace('Nan', np.nan)
, everything works.
I added print statements just before the np.where() statement to show that it is indeed a null value, and therefore should apply the 2nd logic (df['regex_unit_temp']), not the first (df['unit of measure'].str.lower()).
What am I not understanding about how df = df.replace('Nan', np.nan)
is used in this function?
CodePudding user response:
I have figured out the root problem, although I lack understanding of why it causes a problem, so any thoughts or further explanation would be appreciated.
The problem is with this line being used in the np.where() statement:
df['unit of measure'].str.lower()
The np.where() statement works correctly if changed to this:
df['regex_unit'] = np.where(df['unit of measure'].notna(),
True,
False)
And works correctly if changed to this:
df['regex_unit'] = np.where(df['unit of measure'].notna(),
df['unit of measure'],
df['city'])
But the problem occurs when we turn df['unit of measure']
into df['unit of measure'].str.lower()
.
df['unit of measure']
returns a series, and df['unit of measure'].str.lower()
also returns a series. I'm not sure why this would cause a problem if we are only applying the str.lower() method on a cell that is notna(). Again, any further clarification would be appreciated, but for now my question is answered.