I have a dataframe with the following columns. When I do correlation matrix, I see only the columns that are of int data types. I am new to ML, Can someone guide me what is the mistake I am doing here ?
CodePudding user response:
From the docs, by default numeric_only
is set to True
in the corr
function. You need to set it to False
so it compares non numeric columns. Observe that the columns in your final results were the only ones with numeric dtypes.
This behaviour is deprecated though: in future versions of pandas, numeric_only
will be set to False
.
CodePudding user response:
As you correctly observe and @Kraigolas states from the docs
numeric_onlybool, default True Include only float, int or boolean data.
Meaning that by default will only compute values from numerical columns. You can change this by using:
df.corr(numeric_only=False)
However, this means pandas will try to converte the values to float to perform the correlation, but if the values in the columns are not numerical, it will fail returning:
ValueError: could not convert string to float: 'X'
CodePudding user response:
Convert the non-numeric numbers to numeric values using pd.to_numeric.
df = df.apply([pd.to_numeric])
Also, convert all categorical data such as city name to dummy variables that can be used to compute correlation, as is done in this thread. Essentially, all the data you want to compute correlation on needs to be either a float or integer, preferably all one or the other, otherwise, you're likely to have problems.