I don't know if it is right to say "standardize" categorical variable string, but basically I want to create a function to set all observations F or f in the column below to 0 and M or m to 1:
> df['gender']
gender
f
F
f
M
M
m
I tried this:
def padroniza_genero(x):
if(x == 'f' or x == 'F'):
replace(['f', 'F'], 0)
else:
replace(1)
df1['gender'] = df1['gender'].apply(padroniza_genero)
But I got an error:
NameError: name 'replace' is not defined
Any ideas? Thanks!
CodePudding user response:
There is no replace
function defined in your code.
Back to your goal, use a vector function.
Convert to lower and map f->0, m->1:
df['gender_num'] = df['gender'].str.lower().map({'f': 0, 'm': 1})
Or use a comparison (not equal to f) and conversion from boolean to integer:
df['gender_num'] = df['gender'].str.lower().ne('f').astype(int)
output:
gender gender_num
0 f 0
1 F 0
2 f 0
3 M 1
4 M 1
5 m 1
generalization
you can generalize to ant number of categories using pandas.factorize
. Advantage: you will get a real Categorical
type.
NB. the number values is set depending on whatever values comes first, or lexicographic order if sort=True
:
s, key = pd.factorize(df['gender'].str.lower(), sort=True)
df['gender_num'] = s
key = dict(enumerate(key))
# {0: 'f', 1: 'm'}