Home > Back-end >  Applying function across all cells of data frame and return indices (R)
Applying function across all cells of data frame and return indices (R)

Time:01-25

I have a large data frame that I need to compare if a value in, say, row 1 column 2 is within 25 percent of row 1 column 1 and then repeat for each column and each row.

Edit: Not all cells are compared to 1,1. They are compared to the one before it, i.e. (1,2) is compared to (1,1), (1,3) is compared to (1,2), (2,2) is compared to (2,1), and (2,3) is compared to (2,2)

Quick example:

     1     2     3
1    40    50    90
2    25    60    43

In this case I would need to return something like (1,3), (2,2), (2,3).

Here's what I coded but it's incredibly slow for large data frames (as I expected) and while I know how to speed this up in C, C , Python, etc. I am newer to R and not sure what to do.

off = data.frame(matrix(ncol=2,nrow=0))
colnames(off) = c("Row", "Col")

for (row in 1:nrow(data)) {
    for (col in 2:ncol(data)) {
        orig = data[[row, col]]
        comp = data[[row, col-1]]
        if ((orig > comp & orig > 1.1*comp) | 
                (orig < comp & orig < 0.9*comp)) {
                
            off[nrow(off) 1,] = c(row, col)
        }
    }
}

Thank you for any help in advance and please ask any clarifying questions.

CodePudding user response:

Let's do this column-wise (no for loops required):

mtx <- structure(c(40L, 25L, 50L, 60L, 90L, 43L), dim = 2:3)
which(cbind(FALSE,
  mapply(function(a, b) abs(mtx[,b] / mtx[,a] - 1) <= 0.25,
         1:(ncol(mtx)-1), 2:ncol(mtx))),
  arr.ind = TRUE)
#      row col
# [1,]   1   2

Breakdown:

  • mapply(...) iterates the function over two vectors/lists. In this case, we iterate over 1:(ncol(mtx)-1) joined with 2:ncol(mtx), so the anon-function is called with (1,2), (2,3) (and more if the matrix had more columns).

  • In the internal anon-function, mtx[,b] / mtx[,a] computes the ratio for a whole column at a time, so in the first call it's mtx[,2] / mtx[,1]. Since this is a ratio, we can reduce to %-change by subtracting 1. Since we need to find those with 25% or less change, we end up with abs(mtx[,b] / mtx[,a] - 1) <= 0.25.

    That step is for each pair of consecutive columns.

  • The which(..., arr.ind=TRUE) returns a two-column matrix with column names row and col, indicating where in the provided matrix the TRUE cells are found.

  • The mapply(..) is reductive in that it returns ncol(mtx)-1 columns; since arr.ind='s col column will be one-off, we can either add 1 to col afterwards, or we can simply add a column of false to the left of the matrix returned from mapply, I opted for that option.

  • Related