Home > Enterprise >  Why correlation matrix's column is smaller than pandas Dataframe's
Why correlation matrix's column is smaller than pandas Dataframe's

Time:11-14

When I use pandas.DataFrame.corr() to create a correlation matrix, I found the correlation matrix(corr_matrix) has 37 columns and the DataFrame(all_data) has 80 columns. In my mind, these two columns should be the same. In another word, the correlation matrix should have the shape (80 x 80). But this did not happen. I have imputed all missing data before creating the correlation matrix. So why the two columns are not equal?

The code

corr_matrix = all_data.corr(method="kendall").abs()
print("Missing value descending:\n{}\n".format(all_data.isnull().sum().sort_values(ascending=False)[:5]))
print("Original Dataframe shape: {}".format(all_data.shape))
print("Correlation Matrix shape: {}".format(corr_matrix.shape))

The output

Missing value descending:

MSSubClass 0

MSZoning 0

GarageYrBlt 0

GarageType 0

FireplaceQu 0

dtype: int64

Original Dataframe shape: (2904, 80)

Correlation Matrix shape: (37, 37)

CodePudding user response:

Does the train DataFrame contain categorical columns?

Only the correlation between numerical columns is considered, categorical columns are ignored. At least, based on the following example

train = pd.DataFrame({
    "cat1": list("ABC"),
    "cat2": list("xyz"),
    "num1": [1,2,3],
    "num2": [-2,10,-5]
})

# 2 numerical and 2 categorical columns
>>> train 

  cat1 cat2  num1  num2
0    A    x     1    -2
1    B    y     2    10
2    C    z     3    -5

# only numerical columns are present 
>>> train.corr(method="kendall").abs()

          num1      num2
num1  1.000000  0.333333
num2  0.333333  1.000000
  • Related