Home > Back-end >  For a dataset of 0's and 1's, set all but the first 1 in each row to 0's
For a dataset of 0's and 1's, set all but the first 1 in each row to 0's

Time:08-11

I have a data.frame of 1,480 rows and 1,400 columns like:

     1  2  3  4  5  6 ..... 1399  1400
1    0  0  0  1  0  0 .....    1     0     #first occurrence would be at 4
2    0  0  0  0  0  1 .....    0     1
3    1  0  0  1  0  0 .....    0     0
## and etc

Each row contains a series of 0's and 1's - predominantly 0's. For each row, I want to find at which column the first 1 shows up and set the remaining values to 0's.

My current implementation can efficiently find the occurrence of the first 1, but I've only figured out how to zero out the remaining values iteratively by row. In repeated simulations, this iterative process is taking too long.

Here is the current implementation:

  N <- length(df[which(df$arm == 0), "pt_id"]) # of patients
  M <- max_days
  
  # 
  # df is like the data frame shown above
  #
  df[which(df$arm == 0), 5:length(colnames(df))] <- unlist(lapply(matrix(data = rep(pbo_hr, M*N), nrow=N, ncol = M), rbinom, n=1, size = 1))
  
  event_day_post_rand <- apply(df[,5:length(colnames(df))], MARGIN = 1, FUN = function(x) which (x>0)[1])
  df <- add_column(df, "event_day_post_rand" = event_day_post_rand, .after = "arm_id")
  
  ##
  ## From here trial days start on column 6 for df
  ##
  
  #zero out events that occurred after the first event, since each patient can only have 1 max event which will be taken as the earliest event
  for (pt_id in df[which(!is.na(df$event_day_post_rand)),"pt_id"]){
    event_idx = df[which(df$pt_id == pt_id), "event_day_post_rand"]
    df[which(df$pt_id == pt_id), as.character(5 event_idx 1):"1400"] <- 0
  }

CodePudding user response:

We can do

mat <- as.matrix(df)  ## data frame to matrix

j <- max.col(mat, ties.method = "first")
mat[] <- 0
mat[cbind(1:nrow(mat), j)] <- 1

df <- data.frame(mat)  ## matrix to data frame

I also suggest just using a matrix to store these values. In addition, the result will be a sparse matrix. So I recommend

library(Matrix)
sparseMatrix(i = 1:nrow(mat), j = j, x = rep(1, length(j)))

CodePudding user response:

We can get a little more performance by setting the 1 elements to 0 whose rows are duplicates.

Since the OP is open to starting with a matrix rather than a data.frame, I'll do the same.

# dummy data
m <- matrix(sample(0:1, 1480L*1400L, TRUE, c(0.9, 0.1)), 1480L, 1400L)

# proposed solution
f1 <- function(m) {
  ones <- which(m == 1L)
  m[ones[duplicated((ones - 1L) %% nrow(m), nmax = nrow(m))]] <- 0L
  m
}

# Zheyuan Li's solution
f2 <- function(m) {
  j <- max.col(m, ties.method = "first")
  m[] <- 0L
  m[cbind(1:nrow(m), j)] <- 1L
  m
}

microbenchmark::microbenchmark(f1 = f1(m),
                               f2 = f2(m),
                               check = "identical")
#> Unit: milliseconds
#>  expr     min       lq     mean  median      uq     max neval
#>    f1  9.1457 11.45020 12.04258 11.9011 12.3529 37.6716   100
#>    f2 12.8424 14.92955 17.31811 15.3251 16.0550 43.6314   100

Zheyuan Li's suggestion to go with a sparse matrix is a good idea.

# convert to a memory-efficient nsparseMatrix
library(Matrix)
m1 <- as(Matrix(f1(m), dimnames = list(NULL, NULL), sparse = TRUE), "nsparseMatrix")
object.size(m)
#> 8288216 bytes
object.size(m1)
#> 12864 bytes
# proposed function to go directly to a sparse matrix
f3 <- function(m) {
  n <- nrow(m)
  ones <- which(m == 1L) - 1L
  i <- ones %% n
  idx <- which(!duplicated(i, nmax = n))
  sparseMatrix(i[idx], ones[idx] %/% n, dims = dim(m), index1 = FALSE, repr = "C")
}
# going directly to a sparse matrix using Zheyuan Li's solution
f4 <- function(m) {
  sparseMatrix(1:nrow(m), max.col(m, ties.method = "first"), dims = dim(m), repr = "C")
}

identical(m1, f3(m))
#> [1] TRUE
identical(m1, f4(m))
#> [1] TRUE
microbenchmark::microbenchmark(f1 = f1(m),
                               f3 = f3(m),
                               f4 = f4(m))
#> Unit: milliseconds
#>  expr    min      lq     mean  median       uq     max neval
#>    f1 9.1719 9.30715 11.12569 9.52300 11.92740 83.8518   100
#>    f3 7.4330 7.59875 12.62412 7.69610 11.08815 84.8291   100
#>    f4 8.9607 9.31115 14.01477 9.49415 11.44825 87.1577   100
  • Related