How do I find correlation between two categorical variable in python?-CodePudding

I have this two-variable as shown in the picture. What should be my approach if I want to find the correlation between these two?

CodePudding user response：

Consider you have two categorical columns in dataframe cat1, cat2. Then you can check the correlation using chi-square test of independence by the following:

from scipy.stats import chi2_contingency
from scipy import stats
contingency_table = pd.crosstab(
    df['CAT1'],
    df['CAT2'],
    margins = True
)
f_obs = np.array([contingency_table.iloc[0][0:2].values, contingency_table.iloc[1][0:2].values])
stats.chi2_contingency(f_obs)

It returns chi2 - test statistic and p value to test hypothesis. If the P value less than 0.05 then you can assume there is relationship between categories. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html

CodePudding user response：

You could also start by using an ordinal encoder to 'encode' the data and look at some form of correlation post that.

The encoder would look something like this based upon the pic:

from sklearn.preprocessing import OrdinalEncoder

ord_enc = OrdinalEncoder()

df['number_emp_code'] = ord_enc.fit_transform(cp_drop[['Number of employees']])
df['Method_code'] = ord_enc.fit_transform(cp_drop[['Method']])
df[['number_emp_code', 'Number of employees', 'Method_code', 'Method']]

This will encode your data, and put the encoded data into new columns. The last line is just a way to display the new column against the originals in isolation.