Randomly remove some numeric data from a matrix in R?-CodePudding

I have a large data matrix with many numeric values (counts) in it. I would like to remove 10% of all counts. So, for example, a matrix which looks like this:

30 10
 0 20

The sum of all counts here is 60. 10% of 60 is 6. So I want to randomly remove 6. A correct output could be:

29 6
 0 19

(As you can see it removed 1 from 30, 4 from 10 and 1 from 20). There cannot be negative values.

How could I program this in R?

CodePudding user response：

Maybe this helps you at least to get on the right track. It's nothing more than a draft though:

randomlyRemove <- function(matrix) {
  sum_mat <- sum(matrix)
  while (sum_mat > 0) {
    sum_mat <- sum_mat - runif(1, min = 0, max = sum_mat)
    x <- round(runif(1, 1, dim(matrix)[1]), digits = 0)
    y <- round(runif(1, 1, dim(matrix)[2]), digits = 0)
    
    matrix[x,y] <- matrix[x,y] - sum_mat
  }
  return(matrix)
}

You might want to play with the random number generator process to get more evenly distributed substractions.

edit: added round(digits = 0) to get only integer (dimension) values and modified the random (dimension) value generation to start from 1 (not zero).

CodePudding user response：

Here is a way. It subtracts 1 to positive matrix elements until a certain total to remove is reached.

subtract_int <- function(X, n){
  inx <- which(X != 0, arr.ind = TRUE)
  N <- nrow(inx)
  while(n > 0){
    i <- sample(N, 1)
    if(X[ inx[i, , drop = FALSE] ] > 0){
      X[ inx[i, , drop = FALSE] ] <- X[ inx[i, , drop = FALSE] ] - 1
      n <- n - 1
    }
    if(any(X[inx] == 0)){
      inx <- which(X != 0, arr.ind = TRUE)
      N <- nrow(inx)
    }
  }
  X
}

set.seed(2021)
to_remove <- round(sum(A)*0.10)
subtract_int(A, to_remove)
#     [,1] [,2]
#[1,]   30    6
#[2,]    0   18

Data

A <- structure(c(30, 0, 10, 20), .Dim = c(2L, 2L))

CodePudding user response：

I think we can make it work with using sample. This solution is a lot more compact.

The data

A <- structure(c(30, 0, 11, 20), .Dim = c(2L, 2L))
sum(A)
#> [1] 61

The logic

UseThese <- (1:length(A))[A > 0] # Choose indices to be modified because > 0
Sample <- sample(UseThese, sum(A)*0.1, replace = TRUE) # Draw a sample of indices
A[UseThese] <- A[UseThese] - as.vector(table(Sample)) # Subtract handling repeated duplicate indices in the sample

Check the result

A
#>      [,1] [,2]
#> [1,]   28    8
#> [2,]    0   19
sum(A) # should be the value above minus 6
#> [1] 55

One disadvantage of this solution is that it could lead to negative values. So check with:

any(A < 0)
#> [1] FALSE