Home > Software design >  sort / max string with np.nan in pandas
sort / max string with np.nan in pandas

Time:04-09

pandas throws err when i try to get max of a string column that has np.nan, since np.nan is float type and can't be compared to str.

Any suggest how to handle this?

df = pd.DataFrame({'letters':['a', 'b', np.nan]})
df

# letters
# 0 a
# 1 b
# 2 NaN

for e in df['letters']:
    print(e, type(e))
    
# a <class 'str'>
# b <class 'str'>
# nan <class 'float'>
    
df['letters'].max()

gives err:

TypeError: '>=' not supported between instances of 'str' and 'float'

----update-----

dropna works for simple sort/max, but doesn't work when there's groupby, since it will delete groups. Eg

df = pd.DataFrame({'letters':['a', 'b', np.nan, np.nan]
                  ,'grp': [1,1,1,2]})
df
#   letters grp
# 0 a   1
# 1 b   1
# 2 NaN 1
# 3 NaN 2
        
df.groupby('grp')['letters'].max()
# dropna will delete grp == 2

CodePudding user response:

This is an issue as you use the default, float NaN.

This works fine with the new pd.NA type, which you can obtain using convert_dtypes to have string type instead of object:

df = df.convert_dtypes()
df['letters'].max()
# 'b'

df['letters'].max(skipna=False)
# <NA>

df after convert_dtypes:

  letters
0       a
1       b
2    <NA>

dtypes:

df.dtypes

letters    string
dtype: object

CodePudding user response:

Perhaps you could drop them first:

out = df['letters'].dropna().max()

If you need to find the max of multiple columns, then you could stack them; then use groupby max:

out = df.stack().groupby(level=1).max()

Output:

'b'
  • Related