Shaken faith in `qr()`-CodePudding

I have relied on the qr() function a lot in dealing with rank-deficient situations, but have recently run into some examples where it doesn't work correctly. Consider the matrix badX below:

badX <-
structure(c(-1.641906809157e-10, 0, 0, 0, 0, -0.5, 0, 0, -1.10482935525559e-16, 
            0, -3.06266685765538e-17, 0, -4.83736007092039e-17, 0, -3.14414492582296e-18, 
            -3.06158275836099e-18), dim = c(4L, 4L), dimnames = list(c("(Intercept)", 
            "A2", "A3", "B2"), NULL))

We cannot invert this matrix using the solve():

solve(badX)
## Error in solve.default(badX): system is computationally singular: reciprocal condition number = 5.55308e-18

Yet qr() and its associated routines thinks this matrix has a rank of 4 and it can invert it:

qr(badX)$rank
## [1] 4

qr.solve(badX)
##             [,1] [,2]          [,3]          [,4]
## [1,] -6090479645    0  2.197085e 10  7.366741e 10
## [2,]           0   -2  0.000000e 00  0.000000e 00
## [3,]           0    0 -3.265128e 16  3.353179e 16
## [4,]           0    0  0.000000e 00 -3.266284e 17

This is a pretty ugly result. I have tried varying the tol argument, with no change in results.

For context, the origin of this result is this contrast matrix:

badL <-
structure(c(0, 0, 0, 0, 0, -9.89189274870351e-11, 0, -5.55111512312578e-17, 
    -2.77555756156289e-17, 1.11022302462516e-16, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.25, 0, 0, 0, 0, -0.25, 0, 0, 
    0, 9.89189274870351e-11, 0, 5.55111512312578e-17, 2.77555756156289e-17, 
    -1.11022302462516e-16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, -4.23939184015843e-11, 0, -4.16333634234434e-17, -1.38777878078145e-17, 
    5.55111512312578e-17, 0, 0, 0, 0, 0, -4.23939184015843e-11, 0, 
    -4.16333634234434e-17, -1.38777878078145e-17, 5.55111512312578e-17, 
    0, 0, 0, 0, 0, 0, 0.25, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.25, 0, 0, 
    0, 0, 0, 0, 0, 0, 4.23939184015843e-11, 0, 4.16333634234434e-17, 
    1.38777878078145e-17, -5.55111512312578e-17, 0, 0, 0, 0, 0, -1.41313127284714e-11, 
    0, -6.93889390390723e-18, -6.93889390390723e-18, 1.38777878078145e-17, 
    4.23939184015843e-11, 0, 4.16333634234434e-17, 1.38777878078145e-17, 
    -5.55111512312578e-17, 0, 0, 0, 0, 0), dim = c(5L, 24L), dimnames = list(
    NULL, c("(Intercept)", "A2", "A3", "B2", "B3", "C2", "C3", 
    "A2:B2", "A3:B2", "A2:B3", "A3:B3", "A2:C2", "A3:C2", "A2:C3", 
    "A3:C3", "B2:C2", "B3:C2", "B2:C3", "B3:C3", "A2:B2:C2", 
    "A3:B2:C2", "A3:B3:C2", "A2:B2:C3", "A3:B2:C3")))

... from which I obtained the QR decomposition of its transpose, to find that it is supposedly of rank 4:

badQR <- qr(t(badL))
badQR$rank
## [1] 4

The above matrix badX is equal to qr.R(badQR)[1:4, 1:4] which based on the rank calculation, was supposed to be a full-rank upper-triangular matrix.

My remedy seems to be to use zapsmall() so that I get the rank right...

qr(zapsmall(t(badL)))$rank
## [1] 1

My question is, why does this happen? If you look at badL, it's pretty clear that it has three zero rows and only the second row is nonzero. I would have thought that qr()'s pivoting methods would work better with this. Is there a better way to get more reliable code?

I am running Windows 11 Pro, version 10.0.22000 build 22000. Here's my R system information.

R.Version()
## $platform
## [1] "x86_64-w64-mingw32"
## 
## $arch
## [1] "x86_64"
## 
## $os
## [1] "mingw32"
## 
## $crt
## [1] "ucrt"
## 
## $system
## [1] "x86_64, mingw32"
## 
## $status
## [1] ""
## 
## $major
## [1] "4"
## 
## $minor
## [1] "2.0"
## 
## $year
## [1] "2022"
## 
## $month
## [1] "04"
## 
## $day
## [1] "22"
## 
## $`svn rev`
## [1] "82229"
## 
## $language
## [1] "R"
## 
## $version.string
## [1] "R version 4.2.0 (2022-04-22 ucrt)"
## 
## $nickname
## [1] "Vigorous Calisthenics"

^{Created on 2022-06-21 by the reprex package (v2.0.1)}

CodePudding user response：

You are complaining that solve can not invert a matrix that seems to be full rank (according to qr). And you think that solve is doing the correct thing, while qr is not.

Well, do not trust solve. It is not a robust numerical procedure and we can fool it easily. Here is a diagonal matrix. It is certainly invertible (by simply inverting its diagonal elements), but solve just can't do it.

D <- diag(c(1, 1e-20))
#     [,1]  [,2]
#[1,]    1 0e 00
#[2,]    0 1e-20

solve(D)
#Error in solve.default(D) : 
#  system is computationally singular: reciprocal condition number = 1e-20

Dinv <- diag(c(1, 1e 20))

## an identity matrix, as expected
D %*% Dinv
#     [,1] [,2]
#[1,]    1    0
#[2,]    0    1

## an identity matrix, as expected
Dinv %*% D
#     [,1] [,2]
#[1,]    1    0
#[2,]    0    1

Now let's look at your badX, which I call R (as it is the upper triangular matrix returned by QR factorization).

R <-
structure(c(-1.641906809157e-10, 0, 0, 0, 0, -0.5, 0, 0, -1.10482935525559e-16, 
            0, -3.06266685765538e-17, 0, -4.83736007092039e-17, 0, -3.14414492582296e-18, 
            -3.06158275836099e-18), dim = c(4L, 4L))

solve can not invert it, but qr.solve gives you a proper inverse matrix.

Rinv <- qr.solve(R)

## an identity matrix, as expected
R %*% Rinv
#     [,1] [,2] [,3]         [,4]
#[1,]    1    0    0 1.776357e-15
#[2,]    0    1    0 0.000000e 00
#[3,]    0    0    1 0.000000e 00
#[4,]    0    0    0 1.000000e 00

## an identity matrix, as expected
Rinv %*% R
#     [,1] [,2] [,3]         [,4]
#[1,]    1    0    0 5.293956e-23
#[2,]    0    1    0 0.000000e 00
#[3,]    0    0    1 1.387779e-17
#[4,]    0    0    0 1.000000e 00

QR factorization is numerically stable, by being less sensitive to the scale (or size, magnitude) of different columns. (Other factorizations like LU (on which solve is based) and SVD do.) By definition, this factorization does

X = Q R

If we re-scale X's columns by right multiplying a full-rank diagonal matrix D, the QR factorization does not change.

X D = Q R D

So let's look at your big matrix t(badL) to which you apply the QR factorization. I call it X.

X <- structure(c(0, -9.89189274870351e-11, 0, 0, 0, 0, 0, 9.89189274870351e-11, 
0, 0, 0, -4.23939184015843e-11, 0, -4.23939184015843e-11, 0, 
0, 0, 0, 0, 4.23939184015843e-11, 0, -1.41313127284714e-11, 4.23939184015843e-11, 
0, 0, 0, 0, 0, 0, -0.25, -0.25, 0, 0, 0, 0, 0, 0, 0, 0, 0.25, 
0, 0.25, 0, 0, 0, 0, 0, 0, 0, -5.55111512312578e-17, 0, 0, 0, 
0, 0, 5.55111512312578e-17, 0, 0, 0, -4.16333634234434e-17, 0, 
-4.16333634234434e-17, 0, 0, 0, 0, 0, 4.16333634234434e-17, 0, 
-6.93889390390723e-18, 4.16333634234434e-17, 0, 0, -2.77555756156289e-17, 
0, 0, 0, 0, 0, 2.77555756156289e-17, 0, 0, 0, -1.38777878078145e-17, 
0, -1.38777878078145e-17, 0, 0, 0, 0, 0, 1.38777878078145e-17, 
0, -6.93889390390723e-18, 1.38777878078145e-17, 0, 0, 1.11022302462516e-16, 
0, 0, 0, 0, 0, -1.11022302462516e-16, 0, 0, 0, 5.55111512312578e-17, 
0, 5.55111512312578e-17, 0, 0, 0, 0, 0, -5.55111512312578e-17, 
0, 1.38777878078145e-17, -5.55111512312578e-17, 0), dim = c(24L, 
5L))

#               [,1]  [,2]          [,3]          [,4]          [,5]
# [1,]  0.000000e 00  0.00  0.000000e 00  0.000000e 00  0.000000e 00
# [2,] -9.891893e-11  0.00 -5.551115e-17 -2.775558e-17  1.110223e-16
# [3,]  0.000000e 00  0.00  0.000000e 00  0.000000e 00  0.000000e 00
# [4,]  0.000000e 00  0.00  0.000000e 00  0.000000e 00  0.000000e 00
# [5,]  0.000000e 00  0.00  0.000000e 00  0.000000e 00  0.000000e 00
# [6,]  0.000000e 00 -0.25  0.000000e 00  0.000000e 00  0.000000e 00
# [7,]  0.000000e 00 -0.25  0.000000e 00  0.000000e 00  0.000000e 00
# [8,]  9.891893e-11  0.00  5.551115e-17  2.775558e-17 -1.110223e-16
# [9,]  0.000000e 00  0.00  0.000000e 00  0.000000e 00  0.000000e 00
#[10,]  0.000000e 00  0.00  0.000000e 00  0.000000e 00  0.000000e 00
#[11,]  0.000000e 00  0.00  0.000000e 00  0.000000e 00  0.000000e 00
#[12,] -4.239392e-11  0.00 -4.163336e-17 -1.387779e-17  5.551115e-17
#[13,]  0.000000e 00  0.00  0.000000e 00  0.000000e 00  0.000000e 00
#[14,] -4.239392e-11  0.00 -4.163336e-17 -1.387779e-17  5.551115e-17
#[15,]  0.000000e 00  0.00  0.000000e 00  0.000000e 00  0.000000e 00
#[16,]  0.000000e 00  0.25  0.000000e 00  0.000000e 00  0.000000e 00
#[17,]  0.000000e 00  0.00  0.000000e 00  0.000000e 00  0.000000e 00
#[18,]  0.000000e 00  0.25  0.000000e 00  0.000000e 00  0.000000e 00
#[19,]  0.000000e 00  0.00  0.000000e 00  0.000000e 00  0.000000e 00
#[20,]  4.239392e-11  0.00  4.163336e-17  1.387779e-17 -5.551115e-17
#[21,]  0.000000e 00  0.00  0.000000e 00  0.000000e 00  0.000000e 00
#[22,] -1.413131e-11  0.00 -6.938894e-18 -6.938894e-18  1.387779e-17
#[23,]  4.239392e-11  0.00  4.163336e-17  1.387779e-17 -5.551115e-17
#[24,]  0.000000e 00  0.00  0.000000e 00  0.000000e 00  0.000000e 00

Let's re-scale its columns so that every column has Euclidean norm (L2 norm, 2-norm) 1.

norm2 <- sqrt(colSums(X ^ 2))

XD <- X * rep(1 / norm2, each = nrow(X))

#             [,1] [,2]        [,3]       [,4]        [,5]
# [1,]  0.00000000  0.0  0.00000000  0.0000000  0.00000000
# [2,] -0.60246371  0.0 -0.48418203 -0.5714286  0.57585260
# [3,]  0.00000000  0.0  0.00000000  0.0000000  0.00000000
# [4,]  0.00000000  0.0  0.00000000  0.0000000  0.00000000
# [5,]  0.00000000  0.0  0.00000000  0.0000000  0.00000000
# [6,]  0.00000000 -0.5  0.00000000  0.0000000  0.00000000
# [7,]  0.00000000 -0.5  0.00000000  0.0000000  0.00000000
# [8,]  0.60246371  0.0  0.48418203  0.5714286 -0.57585260
# [9,]  0.00000000  0.0  0.00000000  0.0000000  0.00000000
#[10,]  0.00000000  0.0  0.00000000  0.0000000  0.00000000
#[11,]  0.00000000  0.0  0.00000000  0.0000000  0.00000000
#[12,] -0.25819930  0.0 -0.36313652 -0.2857143  0.28792630
#[13,]  0.00000000  0.0  0.00000000  0.0000000  0.00000000
#[14,] -0.25819930  0.0 -0.36313652 -0.2857143  0.28792630
#[15,]  0.00000000  0.0  0.00000000  0.0000000  0.00000000
#[16,]  0.00000000  0.5  0.00000000  0.0000000  0.00000000
#[17,]  0.00000000  0.0  0.00000000  0.0000000  0.00000000
#[18,]  0.00000000  0.5  0.00000000  0.0000000  0.00000000
#[19,]  0.00000000  0.0  0.00000000  0.0000000  0.00000000
#[20,]  0.25819930  0.0  0.36313652  0.2857143 -0.28792630
#[21,]  0.00000000  0.0  0.00000000  0.0000000  0.00000000
#[22,] -0.08606647  0.0 -0.06052275 -0.1428571  0.07198158
#[23,]  0.25819930  0.0  0.36313652  0.2857143 -0.28792630
#[24,]  0.00000000  0.0  0.00000000  0.0000000  0.00000000

What do you think now? Is it still a matrix with only one nonzero column? Although qr(X) does not actually first re-scale all columns before QR factorization, looking at XD does help you better understand why QR factorization is more robust.

If you do want to intervene, do not use zapsmall; threshold columns by their 2-norm, instead.

X0 <- X
X0[, norm2 < max(norm2) * sqrt(.Machine$double.eps)] <- 0
QR0 <- qr(X0)

QR0$rank
# [1] 1

How do we know that sqrt(.Machine$double.eps) is an appropriate threshold?

Any threshold between sqrt(.Machine$double.eps) (about 1e-8) and .Machine$double.eps (about 1e-16) is reasonable. Using .Machine$double.eps recovers the regular QR result, giving you rank 4.

The "sqrt" threshold comes from the situation where we want to look at X'X, which squares the condition number of X.

CodePudding user response：

I would suggest that you prefer Singular Value Decomposition. It will give you the best solution possible if the matrix is rank deficient. Here's an example of how to use it in R.