Pandas - Correlation .corr() for int64 datatypes in a datatype returns the same correlation. Bug?-CodePudding

I'm trying to find the correlations between a target column with datatype float, and other columns with mixed datatypes.

df.datatypes returns:

Pos           object
Age          float64
Year           int64
Pts Won      float64
Pts Max      float64
Share        float64
Team          object
Team Rank    float64
W            float64
L            float64
W/L%         float64
GB            object
PS/G         float64
PA/G         float64
SRS          float64
G TOT        float64
GS TOT       float64
MP TOT       float64
FG TOT       float64
FGA TOT      float64
FG%          float64
3P TOT       float64
3PA TOT      float64
3P%          float64
2P TOT       float64
2PA TOT      float64
2P%          float64
eFG%         float64


dtype: object

Here, when I run the pandas correlation command to find correlations with column Share, everything returns normal with unique correlation values:

Age          0.018080
Year        -0.008203
Pts Won      0.995639
Pts Max      0.523850
Share        1.000000
Team Rank   -0.124671
W            0.119965
L           -0.119570
W/L%         0.124102
PS/G         0.041559
PA/G        -0.039062
SRS          0.118732
G TOT        0.089035
GS TOT       0.166717
MP TOT       0.167609
FG TOT       0.285257
FGA TOT      0.258544
FG%          0.063012
3P TOT       0.118244
3PA TOT      0.120624
3P%          0.009359
2P TOT       0.289153
2PA TOT      0.265193
2P%          0.058526
eFG%         0.055817

However, when I convert select columns to type int64 and rerun the correlation, I recieve repeating correlation values for said int64 type columns:

convert_col = ['Age', 'Team Rank', 'W', 'L' 'GB']
for col in df_final:
    if ('TOT' in col) or (col in convert_col):
        df_final[col] = df_final[col].values.astype(np.int64)

df_final.corr()['Share']

returns:

Age          0.001156
Year        -0.008203
Pts Won      0.995639
Pts Max      0.523850
Share        1.000000
Team Rank    0.004556
W            0.004556
L           -0.119570
W/L%         0.124102
PS/G         0.041559
PA/G        -0.039062
SRS          0.118732
G TOT        0.001156
GS TOT       0.001156
MP TOT       0.001156
FG TOT       0.001156
FGA TOT      0.001156
FG%          0.063012
3P TOT       0.001156
3PA TOT      0.001156
3P%          0.009359
2P TOT       0.001156
2PA TOT      0.001156
2P%          0.058526
eFG%         0.055817

As shown, the columns with type int64 all have correlation of 0.001156 or 0.004556, when this is clearly not the case, both in theory and in back-testing with type float64.

Could anybody explain why this is the case and/or if there is a correction? I converted the datatypes to int64 for user-friendliness / readability purposes.

Samples of data before and after …

Before:

	Player	Pos	Age	Year	...	Share	Team Rank	W	L	W/L%	...
13902	Thaddeus Young	PF	33.0	2022	...	0.001	21.0	34.0	48.0	0.415	...
13903	Trae Young	PG	23.0	2022	...	0.05	15.0	43.0	39.0	0.524	...
13904	Omer Yurtseven	C	23.0	2022	...	0.0	2.0	53.0	29.0	0.646	...

After:

	Player	Pos	Age	Year	...	Share	Team Rank	W	L	W/L%	...
13902	Thaddeus Young	PF	33	2022	...	0.001	21	34	48	0.415	...
13903	Trae Young	PG	23	2022	...	0.05	15	43	39	0.524	...
13904	Omer Yurtseven	C	23	2022	...	0.0	2	53	29	0.646	...

CodePudding user response：

It may be that these columns converted to int64 contain np.nan, causing problems with the conversion. For example, in the conversion result below, nan will be converted to a very small number.

np.array([1,2,3,np.nan]).astype("int64")

array([1, 2, 3,-9223372036854775808], dtype=int64)

If you want avoid this, you can replace np.nan to pd.NA, and use pd.Int64Dtype to convert data type.

df_final[col] = df_final[col].values.astype(pd.Int64Dtype)

Maybe this article will be useful for you:

https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

CodePudding user response：

Changing the float to int conversion formula to df_final = df_final.astype({col:'int'}) seems to have fixed the issue. Now I'm getting unique correlation values for int types.