I want to subset columns of a big dataframe that adhere to this rule:
For each row (except row A) the value should be below 5.
Given the following example dataframe, I want the function to return a dataframe with only column c1 since all values in row B:E are below 5. I think the select_if function is probably the way to go, but I can't figure out how to exclude specific rows in that function.
Gene <- c("A", "B", "C", "D", "E")
c1 <- c(500, 1, 0, 3, 0)
c2 <- c(240, 235, 270, 100, 1)
c3 <- c(0, 3, 1000, 900, 2)
df1 <- data.frame(Gene, c1, c2, c3)
head(df1)
Gene c1 c2 c3
1 A 500 240 0
2 B 1 235 3
3 C 0 270 1000
4 D 3 100 900
5 E 0 1 2
CodePudding user response:
A tidyverse solution is
df1 %>%
select(
df1 %>%
filter(row_number() > 1) %>%
summarise(across(starts_with("c"), max)) %>%
pivot_longer(everything()) %>%
filter(value < 5) %>%
pull(name)
)
c1
1 500
2 1
3 0
4 3
5 0
Explanation: the code inside the select
calculates the minimum value for each column after ignoring the first row. The restult is then pivot
ed into long format, creating default columns name
and value
. This data frame is filtered to select only those columns where every value is less than five. The name
column is then pull
ed and used as an argument to the outer select
.
If you need other columns, just modify the select
, for example,
df1 %>%
select(
c("Gene",
df1 %>%
filter(row_number() > 1) %>%
summarise(across(starts_with("c"), max)) %>%
pivot_longer(everything()) %>%
filter(value < 5) %>%
pull(name)
)
)
CodePudding user response:
A base R solution is simple to code.
i <- sapply(df1[-1], \(x) all(x[-1] < 5))
df1[c(TRUE, i)]
#> Gene c1
#> 1 A 500
#> 2 B 1
#> 3 C 0
#> 4 D 3
#> 5 E 0
Created on 2022-06-03 by the reprex package (v2.0.1)
CodePudding user response:
To avoid reshaping or looping, use the colSums
.
df1[c(1, which(colSums(df1[-1, -1] < 5) == 4) 1)]
# Gene c1
# 1 A 500
# 2 B 1
# 3 C 0
# 4 D 3
# 5 E 0