Reduce number of columns with priority for certain values-CodePudding

I would like to collapse a data frame with < 100 columns fourfold, whereby the code would iterate over groups of 4 adjacent columns and collapse them into one. However, the resulting values based on each set of 4, depend on the priority of the value.

For example, the highest priority is '1', so whenever any of the 4 columns has a value '1' for that row it should be the resulting value. The second priority is 0, if the set has one '0' and three NA's, the result should be '0' (as long as there's no '1's). The lowest priority is NA, only sets consisting of NA completely would remain NA. An example below, with reproducible code underneath.

  ID c1 c2 c3 c4 c5 c6 c7 c8
row1  1  0  0  0  1  0  0 NA
row2 NA NA NA  0 NA NA NA NA

becomes

  ID c1 c2
row1  1  1
row2  0 NA

structure(list(ID = c("row1", "row2"), c1 = c(1, NA), c2 = c(0, 
NA), c3 = c(0, NA), c4 = c(0, 0), c5 = c(1, NA), c6 = c(0, NA
), c7 = c(0, NA), c8 = c(NA, NA)), class = "data.frame", row.names = c(NA, 
-2L))

CodePudding user response：

How about this:

dat <- structure(list(ID = c("row1", "row2"), c1 = c(1, NA), c2 = c(0, 
                                                             NA), c3 = c(0, NA), c4 = c(0, 0), c5 = c(1, NA), c6 = c(0, NA
                                                             ), c7 = c(0, NA), c8 = c(NA, NA)), class = "data.frame", row.names = c(NA, 
                                                                                                                                    -2L))

out <- data.frame(ID = dat$ID)
k <- 2 # first column to start
i <- 1 # first variable name
while(k < ncol(dat)){
  out[[paste0("c", i)]] <- apply(dat[,k:(k 3)], 1, max, na.rm=TRUE)
  out[[paste0("c", i)]] <- ifelse(is.finite(out[[paste0("c", i)]]), out[[paste0("c", i)]], NA)
  k <- k 4
  i <- i 1
}
#> Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
out
#>     ID c1 c2
#> 1 row1  1  1
#> 2 row2  0 NA

^{Created on 2022-11-21 by the reprex package (v2.0.1)}

CodePudding user response：

Here is an alternative approach:


f <- function(x) fifelse(all(is.na(x)), NA_real_,1*(sum(x,na.rm = T)>0))

dcast(
  melt(setDT(df),"ID",variable.name = "v")[
    ,f(value), .(ID,r=paste0("c",rep(1:(uniqueN(v)/4), each=uniqueN(v))))],
  ID~r, value.var = "V1",
)

Output:

     ID c1 c2
1: row1  1  1
2: row2  0 NA

CodePudding user response：

Using split.default to split dataframe every 4th column, then use pmax:

x <- df1
x[ is.na(x) ] <- -1
res <- cbind(df1[ "ID" ],
               lapply(split.default(x[, -1], rep(1:2, each = 4)), 
                      function(i) do.call(pmax, i)))

res[ res == -1 ] <- NA
res
#     ID 1  2
# 1 row1 1  1
# 2 row2 0 NA