Home > OS >  Pandas - Correlation .corr() for int64 datatypes in a datatype returns the same correlation. Bug?
Pandas - Correlation .corr() for int64 datatypes in a datatype returns the same correlation. Bug?

Time:06-20

I'm trying to find the correlations between a target column with datatype float, and other columns with mixed datatypes.

df.datatypes returns:

Pos           object
Age          float64
Year           int64
Pts Won      float64
Pts Max      float64
Share        float64
Team          object
Team Rank    float64
W            float64
L            float64
W/L%         float64
GB            object
PS/G         float64
PA/G         float64
SRS          float64
G TOT        float64
GS TOT       float64
MP TOT       float64
FG TOT       float64
FGA TOT      float64
FG%          float64
3P TOT       float64
3PA TOT      float64
3P%          float64
2P TOT       float64
2PA TOT      float64
2P%          float64
eFG%         float64


dtype: object

Here, when I run the pandas correlation command to find correlations with column Share, everything returns normal with unique correlation values:

Age          0.018080
Year        -0.008203
Pts Won      0.995639
Pts Max      0.523850
Share        1.000000
Team Rank   -0.124671
W            0.119965
L           -0.119570
W/L%         0.124102
PS/G         0.041559
PA/G        -0.039062
SRS          0.118732
G TOT        0.089035
GS TOT       0.166717
MP TOT       0.167609
FG TOT       0.285257
FGA TOT      0.258544
FG%          0.063012
3P TOT       0.118244
3PA TOT      0.120624
3P%          0.009359
2P TOT       0.289153
2PA TOT      0.265193
2P%          0.058526
eFG%         0.055817

However, when I convert select columns to type int64 and rerun the correlation, I recieve repeating correlation values for said int64 type columns:

convert_col = ['Age', 'Team Rank', 'W', 'L' 'GB']
for col in df_final:
    if ('TOT' in col) or (col in convert_col):
        df_final[col] = df_final[col].values.astype(np.int64)

df_final.corr()['Share']

returns:

Age          0.001156
Year        -0.008203
Pts Won      0.995639
Pts Max      0.523850
Share        1.000000
Team Rank    0.004556
W            0.004556
L           -0.119570
W/L%         0.124102
PS/G         0.041559
PA/G        -0.039062
SRS          0.118732
G TOT        0.001156
GS TOT       0.001156
MP TOT       0.001156
FG TOT       0.001156
FGA TOT      0.001156
FG%          0.063012
3P TOT       0.001156
3PA TOT      0.001156
3P%          0.009359
2P TOT       0.001156
2PA TOT      0.001156
2P%          0.058526
eFG%         0.055817

As shown, the columns with type int64 all have correlation of 0.001156 or 0.004556, when this is clearly not the case, both in theory and in back-testing with type float64.

Could anybody explain why this is the case and/or if there is a correction? I converted the datatypes to int64 for user-friendliness / readability purposes.

Samples of data before and after …

Before:

Player Pos Age Year ... Share Team Rank W L W/L% ...
13902 Thaddeus Young PF 33.0 2022 ... 0.001 21.0 34.0 48.0 0.415 ...
13903 Trae Young PG 23.0 2022 ... 0.05 15.0 43.0 39.0 0.524 ...
13904 Omer Yurtseven C 23.0 2022 ... 0.0 2.0 53.0 29.0 0.646 ...

After:

Player Pos Age Year ... Share Team Rank W L W/L% ...
13902 Thaddeus Young PF 33 2022 ... 0.001 21 34 48 0.415 ...
13903 Trae Young PG 23 2022 ... 0.05 15 43 39 0.524 ...
13904 Omer Yurtseven C 23 2022 ... 0.0 2 53 29 0.646 ...

CodePudding user response:

It may be that these columns converted to int64 contain np.nan, causing problems with the conversion. For example, in the conversion result below, nan will be converted to a very small number.

np.array([1,2,3,np.nan]).astype("int64")

array([1, 2, 3,-9223372036854775808], dtype=int64)

If you want avoid this, you can replace np.nan to pd.NA, and use pd.Int64Dtype to convert data type.

df_final[col] = df_final[col].values.astype(pd.Int64Dtype)

Maybe this article will be useful for you:

https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

CodePudding user response:

Changing the float to int conversion formula to df_final = df_final.astype({col:'int'}) seems to have fixed the issue. Now I'm getting unique correlation values for int types.

  • Related