I'm trying to find the correlations between a target column with datatype float, and other columns with mixed datatypes.
df.datatypes
returns:
Pos object
Age float64
Year int64
Pts Won float64
Pts Max float64
Share float64
Team object
Team Rank float64
W float64
L float64
W/L% float64
GB object
PS/G float64
PA/G float64
SRS float64
G TOT float64
GS TOT float64
MP TOT float64
FG TOT float64
FGA TOT float64
FG% float64
3P TOT float64
3PA TOT float64
3P% float64
2P TOT float64
2PA TOT float64
2P% float64
eFG% float64
dtype: object
Here, when I run the pandas correlation command to find correlations with column Share
, everything returns normal with unique correlation values:
Age 0.018080
Year -0.008203
Pts Won 0.995639
Pts Max 0.523850
Share 1.000000
Team Rank -0.124671
W 0.119965
L -0.119570
W/L% 0.124102
PS/G 0.041559
PA/G -0.039062
SRS 0.118732
G TOT 0.089035
GS TOT 0.166717
MP TOT 0.167609
FG TOT 0.285257
FGA TOT 0.258544
FG% 0.063012
3P TOT 0.118244
3PA TOT 0.120624
3P% 0.009359
2P TOT 0.289153
2PA TOT 0.265193
2P% 0.058526
eFG% 0.055817
However, when I convert select columns to type int64
and rerun the correlation, I recieve repeating correlation values for said int64
type columns:
convert_col = ['Age', 'Team Rank', 'W', 'L' 'GB']
for col in df_final:
if ('TOT' in col) or (col in convert_col):
df_final[col] = df_final[col].values.astype(np.int64)
df_final.corr()['Share']
returns:
Age 0.001156
Year -0.008203
Pts Won 0.995639
Pts Max 0.523850
Share 1.000000
Team Rank 0.004556
W 0.004556
L -0.119570
W/L% 0.124102
PS/G 0.041559
PA/G -0.039062
SRS 0.118732
G TOT 0.001156
GS TOT 0.001156
MP TOT 0.001156
FG TOT 0.001156
FGA TOT 0.001156
FG% 0.063012
3P TOT 0.001156
3PA TOT 0.001156
3P% 0.009359
2P TOT 0.001156
2PA TOT 0.001156
2P% 0.058526
eFG% 0.055817
As shown, the columns with type int64
all have correlation of 0.001156 or 0.004556, when this is clearly not the case, both in theory and in back-testing with type float64
.
Could anybody explain why this is the case and/or if there is a correction? I converted the datatypes to int64
for user-friendliness / readability purposes.
Samples of data before and after …
Before:
Player | Pos | Age | Year | ... | Share | Team Rank | W | L | W/L% | ... | |
---|---|---|---|---|---|---|---|---|---|---|---|
13902 | Thaddeus Young | PF | 33.0 | 2022 | ... | 0.001 | 21.0 | 34.0 | 48.0 | 0.415 | ... |
13903 | Trae Young | PG | 23.0 | 2022 | ... | 0.05 | 15.0 | 43.0 | 39.0 | 0.524 | ... |
13904 | Omer Yurtseven | C | 23.0 | 2022 | ... | 0.0 | 2.0 | 53.0 | 29.0 | 0.646 | ... |
After:
Player | Pos | Age | Year | ... | Share | Team Rank | W | L | W/L% | ... | |
---|---|---|---|---|---|---|---|---|---|---|---|
13902 | Thaddeus Young | PF | 33 | 2022 | ... | 0.001 | 21 | 34 | 48 | 0.415 | ... |
13903 | Trae Young | PG | 23 | 2022 | ... | 0.05 | 15 | 43 | 39 | 0.524 | ... |
13904 | Omer Yurtseven | C | 23 | 2022 | ... | 0.0 | 2 | 53 | 29 | 0.646 | ... |
CodePudding user response:
It may be that these columns converted to int64 contain np.nan
, causing problems with the conversion. For example, in the conversion result below, nan will be converted to a very small number.
np.array([1,2,3,np.nan]).astype("int64")
array([1, 2, 3,-9223372036854775808], dtype=int64)
If you want avoid this, you can replace np.nan
to pd.NA
, and use pd.Int64Dtype
to convert data type.
df_final[col] = df_final[col].values.astype(pd.Int64Dtype)
Maybe this article will be useful for you:
https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
CodePudding user response:
Changing the float to int conversion formula to df_final = df_final.astype({col:'int'})
seems to have fixed the issue. Now I'm getting unique correlation values for int
types.