Home > database >  Function in Python to make probability conversion of categorical variable to numerical
Function in Python to make probability conversion of categorical variable to numerical

Time:06-09

I'm working on pandas in python. I want to create a function to convert categorical variables to numerical based on the number each factor of a categorical variable corresponding with Y dependent variable=1 (where possible Y values are 0, 1 and -1) appears divided by the total count of the factor of that particular categorical variable.

Steps: For each factor or category of a categorical variable, count how many times the dependent variable Y=1, and divide this count by the total count of that factor. Convert all the categorical variables this way and create a new dataframe of the converted categorical variables.

Below is the code to generate sample data:

df = pd.DataFrame([['iphone5', 'teams', 'shoe', 1],
['iphone6', 'teams', 'shirt', 0], ['iphone5', 'word', 'shoe', 0], 
['iphone7', 'ppt', 'pants', 0], ['iphone8', 'excel', 'umbrella', 1],
['iphone6', 'teams', 'shoe', 1], ['iphone9', 'publisher', 'food', 0]])

df.columns = ['Monday', 'Tuesday', 'Wednesday', 'Y']
df

sample data, df dispayed

Checking how many times each factor of the categorical variable had Y=1

monday_check = pd.DataFrame(df.groupby(['Monday', 'Y'])['Y'].count())
monday_check

monday_check displayed

Below shows the code to manually convert a categorical variable to numerical as described above.

cond = [df['Monday']=='iphone5',
    df['Monday']=='iphone6',
    df['Monday']=='iphone7',
    df['Monday']=='iphone8',
    df['Monday']=='iphone9']

vals =[monday_check.loc[('iphone5',1)].values.sum()/monday_check.loc['iphone5'].values.sum(),  monday_check.loc[('iphone6',1)].values.sum()/monday_check.loc['iphone6'].values.sum(),
0/monday_check.loc['iphone7'].values.sum(),
monday_check.loc[('iphone8',1)].values.sum()/monday_check.loc['iphone8'].values.sum(),
0/monday_check.loc['iphone9'].values.sum()]

import numpy as np
df['Monday_convert'] = np.select(cond, vals)
df

converted categorical variable to numerical

CodePudding user response:

Because possible Y values are 0, 1 and -1 compare Y column by 1 and get mean per Monday with GroupBy.transform for new column:

df['Monday_convert1'] = df['Y'].eq(1).groupby(df['Monday']).transform('mean')
print (df)
    Monday    Tuesday Wednesday  Y  Monday_convert1  Monday_convert
0  iphone5      teams      shoe  1              0.5             0.5
1  iphone6      teams     shirt  0              0.5             0.5
2  iphone5       word      shoe  0              0.5             0.5
3  iphone7        ppt     pants  0              0.0             0.0
4  iphone8      excel  umbrella  1              1.0             1.0
5  iphone6      teams      shoe  1              0.5             0.5
6  iphone9  publisher      food  0              0.0             0.0

CodePudding user response:

Following up with @Jazrael's answer, I looped through the other columns to convert them:

for i in np.arange(0, len(df.columns)-1):
df[df.columns.tolist()[i] '_convert'] = df['Y'].eq(1).groupby(
    df[df.columns.tolist()[i]]).transform('mean')
  • Related