r2 is completely off when trying to calculate predicted r2 using press statistics-CodePudding

I am trying to calculate the predicted r2 (using https://gist.github.com/benjaminmgross/d71f161d48378d34b6970fa6d7378837), but the values are completely off. Even the standard r2 is not correct.

Using the metric r2, I get 0.8191

Using the benjamin gross method I get: r2 = -19322.08 and pred_r2 = -35204.34

Here is scripts and dataset:

y_true = np.array(y_test) 
xs = X_test

def press_statistic(y_true, y_pred, xs):
    res = y_pred - y_true
    hat = xs.dot(np.linalg.pinv(xs))
    den = (1 - np.diagonal(hat))
    sqr = np.square(res/den)
    return sqr.sum()

def predicted_r2(y_true, y_pred, xs):
    press = press_statistic(y_true=y_true,
                            y_pred=y_pred,
                            xs=xs
    )

    sst  = np.square( y_true - y_true.mean() ).sum()
    return 1 - press / sst
 
def r2(y_true, y_pred):
    sse  = np.square( y_pred - y_true ).sum()
    sst  = np.square( y_true - y_true.mean() ).sum()
    return 1 - sse/sst

print(r2(y_true, y_pred))
print(predicted_r2(y_true, y_pred, xs))

y_test

16698    -7.758248
16699    -8.007173
16700    -8.226193
16701    -8.459754
16702    -8.348888
   
27754   -55.125691
27755   -55.217113
27756   -55.295972
27757   -55.303383
27758   -55.442200
Name: logger, Length: 11061, dtype: float64

y_pred

array([[ -7.21622871],
       [ -7.43596746],
       [ -7.58752355],
       ...,
       [-42.42983352],
       [-42.38907826],
       [-42.31012853]])

X_test

         DayCos   YearCos  ...  lagged_logger106  lagged_logger107
16698 -0.279829  0.836961  ...          2.294633          2.272826
16699 -0.021815  0.837353  ...          2.158491          2.294633
16700  0.237686  0.837744  ...          2.027501          2.158491
16701  0.480989  0.838135  ...          1.745879          2.027501
16702  0.691513  0.838526  ...          1.501611          1.745879
        ...       ...  ...               ...               ...
27754 -0.692563  0.486713  ...        -44.717439        -44.702282
27755 -0.855665  0.486086  ...        -44.918132        -44.717439
27756 -0.960456  0.485460  ...        -45.132487        -44.918132
27757 -0.999793  0.484833  ...        -45.458775        -45.132487
27758 -0.970995  0.484206  ...        -45.672997        -45.458775

[11061 rows x 227 columns]

Can you see, what I am doing wrong? Thx! As I added the y_pred, I realised the arrays are different.. maybe that is why?

CodePudding user response：

The r2 function works exactly as sklearn r2_score function, returns the same results on the same np.arrays. From the snippet you provided I can see the way you obtain the true labels, but not the predicted values. Can you provide a complete snippet containing also an example input?

CodePudding user response：

Usually, the Hat matrix (h) is defined as:

h_ii = X(X'X)X'

and the PRESS statistic is given as:

press = sum((e_i/1-h_ii)²), where the sum runs from i=1 => nth observation

I recommend two things to do:

Check if the hat = xs.dot(np.linalg.pinv(xs)) values in your press_statistic() function is actually equal to h_ii
Try to add_constant in your least square model or ensure your linear regression model has an intercept in order to make an accurate estimation of the total sum of squares (SST).

CodePudding user response：

Thanks for your feedback! Making me look closer at y_pred, made me see that I was comparing an array and and array of lists. I there for had to reshape the array matrix (y_pred_press = y_pred.reshape(1,-1)[0] ) and it worked! I don't know, why I didn't catch that in the beginning, but thx for making me.