I am trying to calculate the predicted r2 (using https://gist.github.com/benjaminmgross/d71f161d48378d34b6970fa6d7378837), but the values are completely off. Even the standard r2 is not correct.
Using the metric r2, I get 0.8191
Using the benjamin gross method I get: r2 = -19322.08 and pred_r2 = -35204.34
Here is scripts and dataset:
y_true = np.array(y_test)
xs = X_test
def press_statistic(y_true, y_pred, xs):
res = y_pred - y_true
hat = xs.dot(np.linalg.pinv(xs))
den = (1 - np.diagonal(hat))
sqr = np.square(res/den)
return sqr.sum()
def predicted_r2(y_true, y_pred, xs):
press = press_statistic(y_true=y_true,
y_pred=y_pred,
xs=xs
)
sst = np.square( y_true - y_true.mean() ).sum()
return 1 - press / sst
def r2(y_true, y_pred):
sse = np.square( y_pred - y_true ).sum()
sst = np.square( y_true - y_true.mean() ).sum()
return 1 - sse/sst
print(r2(y_true, y_pred))
print(predicted_r2(y_true, y_pred, xs))
y_test
16698 -7.758248
16699 -8.007173
16700 -8.226193
16701 -8.459754
16702 -8.348888
27754 -55.125691
27755 -55.217113
27756 -55.295972
27757 -55.303383
27758 -55.442200
Name: logger, Length: 11061, dtype: float64
y_pred
array([[ -7.21622871],
[ -7.43596746],
[ -7.58752355],
...,
[-42.42983352],
[-42.38907826],
[-42.31012853]])
X_test
DayCos YearCos ... lagged_logger106 lagged_logger107
16698 -0.279829 0.836961 ... 2.294633 2.272826
16699 -0.021815 0.837353 ... 2.158491 2.294633
16700 0.237686 0.837744 ... 2.027501 2.158491
16701 0.480989 0.838135 ... 1.745879 2.027501
16702 0.691513 0.838526 ... 1.501611 1.745879
... ... ... ... ...
27754 -0.692563 0.486713 ... -44.717439 -44.702282
27755 -0.855665 0.486086 ... -44.918132 -44.717439
27756 -0.960456 0.485460 ... -45.132487 -44.918132
27757 -0.999793 0.484833 ... -45.458775 -45.132487
27758 -0.970995 0.484206 ... -45.672997 -45.458775
[11061 rows x 227 columns]
Can you see, what I am doing wrong? Thx! As I added the y_pred, I realised the arrays are different.. maybe that is why?
CodePudding user response:
The r2 function works exactly as sklearn r2_score function, returns the same results on the same np.arrays. From the snippet you provided I can see the way you obtain the true labels, but not the predicted values. Can you provide a complete snippet containing also an example input?
CodePudding user response:
Usually, the Hat matrix (h)
is defined as:
hii = X(X'X)X'
and the PRESS statistic is given as:
press = sum((ei/1-hii)2), where the sum runs from i=1 => nth observation
I recommend two things to do:
- Check if the
hat = xs.dot(np.linalg.pinv(xs))
values in yourpress_statistic()
function is actually equal to hii - Try to
add_constant
in your least square model or ensure your linear regression model has anintercept
in order to make an accurate estimation of the total sum of squares(SST)
.
CodePudding user response:
Thanks for your feedback! Making me look closer at y_pred, made me see that I was comparing an array and and array of lists. I there for had to reshape the array matrix (y_pred_press = y_pred.reshape(1,-1)[0]
) and it worked!
I don't know, why I didn't catch that in the beginning, but thx for making me.