Home > Software design >  How to calculate correlation and covariance of X and Y in Python
How to calculate correlation and covariance of X and Y in Python

Time:04-01

Given a table below

X   Y   pr
0   1   0.30
0   2   0.25
1   1   0.15
1   2   0.30

I want to create a custom function that calculates the covariance and variance between X and Y

I need to find the mean of both x and y, then subtract each value from the means obtained earlier. Then multiply the previous results

Here is the not so good code.

def cor(data_frame):
    data_frame[['X']].mean()
    data_frame[['Y']].mean()

    cov = pd.merge(distr_table.groupby('X', as_index=False)['pr'].mean(), distr_table.groupby('Y', as_index=False)['pr'].mean(), how='cross')

I need to find a way to iterate and loop through. Thanks

CodePudding user response:

You don't need the loop. You can simply use the definition of covariance directly with

((data_frame['X']-(data_frame['X']*data_frame['pr']).sum())*
 (data_frame['Y']-(data_frame['Y']*data_frame['pr']).sum())*
 data_frame['pr']).sum()

However I would strongly recommend not reimplementing it as the function already exists within the numpy library

covariance_matrix = np.cov(data_frame['X'],data_frame['Y'],ddof=0,aweights=data_frame['pr'])

Returns the whole covariance matrix and you can access the covariance of X and Y using

covariance_matrix[0,1]

CodePudding user response:

Guys please don't kick my ass, If the code is lengthy and boring, I am a newbie. Please do comment, that will do help me to learn more.

Custom function and all individual values of Covariance and Correlation-coefficient. DataFrame with only X and Y columns.

def cov_corr(df):
     mean_x = df['X'].sum()/len(df)
     mean_y = df['Y'].sum()/len(df)
     
     sub_x = [df['X'][i] - mean_x for i in df.index]
     sub_y = [df['Y'][i] - mean_y for i in df.index]
 
     cov_x = sum([i**2 for i in sub_x])/len(df)
     cov_y = sum([i**2 for i in sub_y])/len(df)
 
     cov_xy = sum([sub_x[i]*sub_y[i] for i in df.index])/len(df)
 
     corr_xy = round(cov_xy/(np.sqrt(cov_x) * np.sqrt(cov_y)), 2)
 
     keys = ['mean_x', 'mean_y', 'sub_x', 'sub_y', 'cov_x', 'cov_y', 'cov_xy', 'corr-xy']
     all_vals = dict(zip(keys, [mean_x, mean_y, sub_x, sub_y, cov_x, cov_y, cov_xy, corr_xy]))
     return all_vals
  • Related