I am working on a method for calculating the correlation between to columns of data from a dataset. The dataset is constructed of 4 columns A1, A2, A3, and Class. My goal is remove A3 if the correlation between A1 & A3 greater than 0.6 or if the correlation between A1 & A3 is less than 0.6.
A sample of the data set is given below:
A1,A2,A3,Class
2,0.4631338,1.5,3
8,0.7460648,3.0,3
6,0.264391038,2.5,2
5,0.4406713,2.3,1
2,0.410438159,1.5,3
2,0.302901816,1.5,2
6,0.275869396,2.5,3
8,0.084782428,3.0,3
The python program that I am using for this project is written like so
from numpy.core.defchararray import count
import pandas as pd
import numpy as np
import numpy as np
def main():
s = pd.read_csv('A1-dm.csv')
print(calculate_correlation(s))
def calculate_correlation(s):
# if correlation > 0.6 or correlation < 0.6 remove A3
s = s[['A1','A3']]
return s.corr()[1,0]
main()
When I run my code I get the following error:
File "C:\Users\physe\AppData\Roaming\Python\Python36\site-packages\pandas\core\indexes\base.py", line 2897, in get_loc
raise KeyError(key) from err
KeyError: (1, 0)
I've reviewed the documentation here. The issue that I'm facing is selecting the 1,0 element from the covariance matrix that is returned by .corr(). Any help with this would be greatly appreciated.
CodePudding user response:
Here is my example:
cor = df.corr()
if cor ['A3'] > 0.6:
train.drop(columns = 'Age', inplace = True)
else:
pass
CodePudding user response:
Try:
corr = df.corr()
if corr['A3'].loc['A1']!=0.6:
df.drop(columns=['A3'], inplace=True)
CodePudding user response:
Use .iloc to get the 1,0 element from the covariance matrix.
Here:
def calculate_correlation(s):
# if correlation > 0.6 or correlation < 0.6 remove A3
s = s[['A1','A3']]
return (s.corr().iloc[1,0])