pandas throws err when i try to get max of a string column that has np.nan
, since np.nan
is float
type and can't be compared to str
.
Any suggest how to handle this?
df = pd.DataFrame({'letters':['a', 'b', np.nan]})
df
# letters
# 0 a
# 1 b
# 2 NaN
for e in df['letters']:
print(e, type(e))
# a <class 'str'>
# b <class 'str'>
# nan <class 'float'>
df['letters'].max()
gives err:
TypeError: '>=' not supported between instances of 'str' and 'float'
----update-----
dropna
works for simple sort/max, but doesn't work when there's groupby
, since it will delete groups. Eg
df = pd.DataFrame({'letters':['a', 'b', np.nan, np.nan]
,'grp': [1,1,1,2]})
df
# letters grp
# 0 a 1
# 1 b 1
# 2 NaN 1
# 3 NaN 2
df.groupby('grp')['letters'].max()
# dropna will delete grp == 2
CodePudding user response:
This is an issue as you use the default, float
NaN.
This works fine with the new pd.NA
type, which you can obtain using convert_dtypes
to have string
type instead of object
:
df = df.convert_dtypes()
df['letters'].max()
# 'b'
df['letters'].max(skipna=False)
# <NA>
df
after convert_dtypes
:
letters
0 a
1 b
2 <NA>
dtypes:
df.dtypes
letters string
dtype: object
CodePudding user response:
Perhaps you could drop them first:
out = df['letters'].dropna().max()
If you need to find the max of multiple columns, then you could stack
them; then use groupby
max
:
out = df.stack().groupby(level=1).max()
Output:
'b'