I'm working on pandas in python. I want to create a function to convert categorical variables to numerical based on the number each factor of a categorical variable corresponding with Y dependent variable=1 (where possible Y values are 0, 1 and -1) appears divided by the total count of the factor of that particular categorical variable.
Steps: For each factor or category of a categorical variable, count how many times the dependent variable Y=1, and divide this count by the total count of that factor. Convert all the categorical variables this way and create a new dataframe of the converted categorical variables.
Below is the code to generate sample data:
df = pd.DataFrame([['iphone5', 'teams', 'shoe', 1],
['iphone6', 'teams', 'shirt', 0], ['iphone5', 'word', 'shoe', 0],
['iphone7', 'ppt', 'pants', 0], ['iphone8', 'excel', 'umbrella', 1],
['iphone6', 'teams', 'shoe', 1], ['iphone9', 'publisher', 'food', 0]])
df.columns = ['Monday', 'Tuesday', 'Wednesday', 'Y']
df
Checking how many times each factor of the categorical variable had Y=1
monday_check = pd.DataFrame(df.groupby(['Monday', 'Y'])['Y'].count())
monday_check
Below shows the code to manually convert a categorical variable to numerical as described above.
cond = [df['Monday']=='iphone5',
df['Monday']=='iphone6',
df['Monday']=='iphone7',
df['Monday']=='iphone8',
df['Monday']=='iphone9']
vals =[monday_check.loc[('iphone5',1)].values.sum()/monday_check.loc['iphone5'].values.sum(), monday_check.loc[('iphone6',1)].values.sum()/monday_check.loc['iphone6'].values.sum(),
0/monday_check.loc['iphone7'].values.sum(),
monday_check.loc[('iphone8',1)].values.sum()/monday_check.loc['iphone8'].values.sum(),
0/monday_check.loc['iphone9'].values.sum()]
import numpy as np
df['Monday_convert'] = np.select(cond, vals)
df
converted categorical variable to numerical
CodePudding user response:
Because possible Y
values are 0, 1 and -1 compare Y
column by 1
and get mean per Monday
with GroupBy.transform
for new column:
df['Monday_convert1'] = df['Y'].eq(1).groupby(df['Monday']).transform('mean')
print (df)
Monday Tuesday Wednesday Y Monday_convert1 Monday_convert
0 iphone5 teams shoe 1 0.5 0.5
1 iphone6 teams shirt 0 0.5 0.5
2 iphone5 word shoe 0 0.5 0.5
3 iphone7 ppt pants 0 0.0 0.0
4 iphone8 excel umbrella 1 1.0 1.0
5 iphone6 teams shoe 1 0.5 0.5
6 iphone9 publisher food 0 0.0 0.0
CodePudding user response:
Following up with @Jazrael's answer, I looped through the other columns to convert them:
for i in np.arange(0, len(df.columns)-1):
df[df.columns.tolist()[i] '_convert'] = df['Y'].eq(1).groupby(
df[df.columns.tolist()[i]]).transform('mean')