Context
I have created a function, that converts Categorial Data
into its unique indices. This works great with all values except NaN
.
It seems that the comparison with NaN
does not work. This results in the two problems seen below.
Code
col1
0 male
1 female
2 NaN
3 female
def categorial(series: pandas.Series) -> pandas.Series:
series = series.copy()
for index, value in enumerate(series.unique()):
# Problem 1: The output for the Value NaN is always 0.0 %, even though nan is present in the given series.
print(index, value, round(series[series == value].count() / len(series) * 100, 2), '%')
for index, value in enumerate(series.unique()):
# Problem 2: Every unique Value is converted to its Index except NaN.
series[series == value] = index
return series.astype(pandas.Int64Dtype())
Question
- How can I solve the two problems seen in the code above?
CodePudding user response:
You can use fillna
with astype
and factorize
:
df['col1'] = df['col1'].fillna('nan').astype(str).factorize()[0]
Sample:
df = pd.DataFrame({'col1':['a','b',np.nan,'c']})
print (df)
col1
0 a
1 b
2 NaN
3 c
df['col1'] = df['col1'].fillna('nan').astype(str).factorize()[0]
print (df)
col1
0 0
1 1
2 2
3 3
CodePudding user response:
How should be encoded missing values nan
s?
In pandas it is obviously -1
:
print (pd.factorize(categorial(df['col1']))[0])
[ 0 1 -1 1]
print (df['col1'].astype('category').cat.codes)
0 1
1 0
2 -1
3 0
dtype: int8