Home > Software engineering >  Unexpected behaviour when combining string columns with nan values in pandas
Unexpected behaviour when combining string columns with nan values in pandas

Time:11-12

With the following df

import numpy as np
import pandas as pd
data = pd.DataFrame({'A':['Peter','Karl'], 'B':[np.nan,'Jackson']})

I want a column AddCols that combines the column A and B, sometimes they could be numbers, but even if so I want to handle with them as strings so I cast:

data['AddCols'] = data['A'].astype('string')  ','  data['B'].astype('string')

But the result, whenever A or B is NAN returns a NAN:

'|    | AddCols      |
 |---:|:-------------|
 |  0 | <NA>         |
 |  1 | Karl,Jackson |'

Using astype(str) creates a literal string nan, that's why I am using astype('string').

Why is this happening and how to solve it? Ideally for the first row (idx 0) I would want simply:

Peter,

CodePudding user response:

Solution is replace NaN to emty strings, because if use operator with strings NaN is not excluded, so np.nan 'any string' = np.nan:

data['AddCols'] = (data['A'].fillna('').astype('string')  ','  
                   data['B'].fillna('').astype('string'))

Or nae_rep in Series.str.cat:

data['AddCols'] = (data['A'].astype('string')
                            .str.cat(data['B'].fillna('').astype('string'), 
                                     na_rep='', sep=','))

print (data)
       A        B       AddCols
0  Peter      NaN        Peter,
1   Karl  Jackson  Karl,Jackson

CodePudding user response:

From documentation, your code is behaving as expected. if you dont give a presentation of NaN alias na_rep, and there is a NaN, missing values in the Series/Index are omitted from the result. That is, the result of concatenating a string to a NaN is NaN

data['AddCols'] = data['A'].astype('string')  ','  data['B'].fillna('').astype(str)
  • Related