Home > Net >  In data frame, find groups containing only certain value (equal to) in given column [R]
In data frame, find groups containing only certain value (equal to) in given column [R]

Time:04-08

I would like to obtain a vector containing names (ID) of groups, in which certain column (class) contains only one type of value (A). In other words: I want to check whether the groups contain only given value (A) in given column (class) - if they do, I would like to collect the name of this groups.

Below are 2 dummy dataframes:

ID <- c(rep("first", 4), rep("second", 5), rep("third", 3))
segment <- c(1,2,6,7,1,2,3,4,6,1,4,5)
class <- c(rep("A", 6), "G", rep("A", 5))

df1 <- data.frame(ID, segment, class)

ID <- c(rep("first", 4), rep("second", 5), rep("third", 3))
segment <- c(1,2,6,7,1,2,3,4,6,1,4,5)
class <- c(rep("A", 6), "G", rep("A", 2), rep("B", 3))

df2 <- data.frame(ID, segment, class)

And here is my dplyr solution:

output_grouped_by_ID <- df2 %>% dplyr::group_by(ID) %>% 
dplyr::mutate(identical = n_distinct(class)==1 & class=="A") %>%
dplyr::filter(identical==FALSE)
uniq <- unique(output_grouped_by_ID$ID)

Is there a faster way using base R or data.table?

CodePudding user response:

The following one-liner in base R should be relatively fast:

names(which(sapply(split(df1$class, df1$ID), function(x) all(x == 'A'))))
#> [1] "first" "third"

This is more than an order of magnitude faster than the current method:

method_sapply <- function() {
  names(which(sapply(split(df1$class, df1$ID), function(x) all(x == 'A'))))
}

method_dplyr <- function() {
  output_grouped_by_ID <- df1 %>% dplyr::group_by(ID) %>% 
    dplyr::mutate(identical = n_distinct(class)==1 & class=="A") %>%
    dplyr::filter(identical==FALSE)
    unique(output_grouped_by_ID$ID)
}

microbenchmark::microbenchmark(method_sapply(), method_dplyr())
#>Unit: microseconds
#>            expr    min      lq     mean  median      uq    max neval cld
#> method_sapply()   53.6   70.70  110.513   78.75   93.40 2836.0   100  a 
#>  method_dplyr() 3330.8 3575.85 3934.029 4020.45 4175.95 7232.1   100   b

CodePudding user response:

Using tapply.

names(which(with(df2, tapply(class, ID, unique) != 'A')))
# [1] "second" "third" 
  • Related