How to subset columns based on threshold values specified for a subset of rows-CodePudding

I want to subset columns of a big dataframe that adhere to this rule:

For each row (except row A) the value should be below 5.

Given the following example dataframe, I want the function to return a dataframe with only column c1 since all values in row B:E are below 5. I think the select_if function is probably the way to go, but I can't figure out how to exclude specific rows in that function.

Gene <- c("A", "B", "C", "D", "E")
c1 <- c(500, 1, 0, 3, 0)
c2 <- c(240, 235, 270, 100, 1)
c3 <- c(0, 3, 1000, 900, 2)
df1 <- data.frame(Gene, c1, c2, c3)
head(df1)

  Gene  c1  c2   c3
1    A 500 240    0
2    B   1 235    3
3    C   0 270 1000
4    D   3 100  900
5    E   0   1    2

CodePudding user response：

A tidyverse solution is

df1 %>% 
  select(
    df1 %>% 
      filter(row_number() > 1) %>% 
      summarise(across(starts_with("c"), max)) %>% 
      pivot_longer(everything()) %>% 
      filter(value < 5) %>% 
      pull(name)
  )
   c1
1 500
2   1
3   0
4   3
5   0

Explanation: the code inside the select calculates the minimum value for each column after ignoring the first row. The restult is then pivoted into long format, creating default columns name and value. This data frame is filtered to select only those columns where every value is less than five. The name column is then pulled and used as an argument to the outer select.

If you need other columns, just modify the select, for example,

df1 %>% 
  select(
    c("Gene", 
    df1 %>% 
      filter(row_number() > 1) %>% 
      summarise(across(starts_with("c"), max)) %>% 
      pivot_longer(everything()) %>% 
      filter(value < 5) %>% 
      pull(name)
  )
)

CodePudding user response：

A base R solution is simple to code.

i <- sapply(df1[-1], \(x) all(x[-1] < 5))
df1[c(TRUE, i)]
#>   Gene  c1
#> 1    A 500
#> 2    B   1
#> 3    C   0
#> 4    D   3
#> 5    E   0

^{Created on 2022-06-03 by the reprex package (v2.0.1)}

CodePudding user response：

To avoid reshaping or looping, use the colSums.

df1[c(1, which(colSums(df1[-1, -1] < 5) == 4)   1)]
#   Gene  c1
# 1    A 500
# 2    B   1
# 3    C   0
# 4    D   3
# 5    E   0