Home > database >  How to find correlation coefficient r in excel sheet using pandas
How to find correlation coefficient r in excel sheet using pandas

Time:10-08

I have an Excel worksheet which is 250 rows by 10 column of data. My dependent variable is n_nnld_trp and I am trying to find which independent variables are highly correlated with it to use in a linear regression model. enter image description here

I want to make a table like this to summarize the correlation data as well as identify any cases of multi-collinearity using the equation in the picture: enter image description here

So far, I managed to use a pivot table to get the mean of each row with my dependent variable being n_hhld_trp:

trip_mean = pd.pivot_table(read_excel, index=['n_hhld_trip'],  
                aggfunc=np.mean)

print(trip_mean.head())

I'm finding it hard to make the table of correlation as shown above and I would welcome and appreciate any help.

CodePudding user response:

Numpy has all necessary functions to calculate any of such common things so to calculate r the dataframe the simplest way is:

import numpy as np
r = np.corrcoef(df.values)

Alternatively, to calculate between individual pairs of variables you can just feed a smaller array to corrcoef function or just calculate it directly:

r = np.cov(df.n_nnld_trp.values, df.other_col.values) / (np.std(df.n_nnld.trp.values) * np.std(df.other_col.values))

CodePudding user response:

After a few hours of digging around I got it to how I wanted to present it. For anyone wanting to do something similar, see code below:

import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pearson_correlation = read_excel.corr(method='pearson')
print(pearson_correlation)

enter image description here

  • Related