I would like to obtain a vector containing names (ID) of groups, in which certain column (class) contains only one type of value (A). In other words: I want to check whether the groups contain only given value (A) in given column (class) - if they do, I would like to collect the name of this groups.
Below are 2 dummy dataframes:
ID <- c(rep("first", 4), rep("second", 5), rep("third", 3))
segment <- c(1,2,6,7,1,2,3,4,6,1,4,5)
class <- c(rep("A", 6), "G", rep("A", 5))
df1 <- data.frame(ID, segment, class)
ID <- c(rep("first", 4), rep("second", 5), rep("third", 3))
segment <- c(1,2,6,7,1,2,3,4,6,1,4,5)
class <- c(rep("A", 6), "G", rep("A", 2), rep("B", 3))
df2 <- data.frame(ID, segment, class)
And here is my dplyr solution:
output_grouped_by_ID <- df2 %>% dplyr::group_by(ID) %>%
dplyr::mutate(identical = n_distinct(class)==1 & class=="A") %>%
dplyr::filter(identical==FALSE)
uniq <- unique(output_grouped_by_ID$ID)
Is there a faster way using base R or data.table?
CodePudding user response:
The following one-liner in base R should be relatively fast:
names(which(sapply(split(df1$class, df1$ID), function(x) all(x == 'A'))))
#> [1] "first" "third"
This is more than an order of magnitude faster than the current method:
method_sapply <- function() {
names(which(sapply(split(df1$class, df1$ID), function(x) all(x == 'A'))))
}
method_dplyr <- function() {
output_grouped_by_ID <- df1 %>% dplyr::group_by(ID) %>%
dplyr::mutate(identical = n_distinct(class)==1 & class=="A") %>%
dplyr::filter(identical==FALSE)
unique(output_grouped_by_ID$ID)
}
microbenchmark::microbenchmark(method_sapply(), method_dplyr())
#>Unit: microseconds
#> expr min lq mean median uq max neval cld
#> method_sapply() 53.6 70.70 110.513 78.75 93.40 2836.0 100 a
#> method_dplyr() 3330.8 3575.85 3934.029 4020.45 4175.95 7232.1 100 b
CodePudding user response:
Using tapply
.
names(which(with(df2, tapply(class, ID, unique) != 'A')))
# [1] "second" "third"