Home > database >  Identifying duplicates in a list of character vectors in R
Identifying duplicates in a list of character vectors in R

Time:02-12

I have a list of character vectors like this:

my_list <- list(c('a','b','c','d','e'),c('e','f','g'),c('h','i','j'))
names(my_list) <- c("group1","group2","group3")

And I want to have a simple way to test my_list for duplicates in the letters across any of the 3 groups/vectors in my list. So for instance, "e" appears in both group 1 and group 2 so that would be a duplicate. Anything simple that just returns a logical if there is at least one or more duplicates across 2 or more groups would be ideal. So a FALSE return would mean that the letters in each group are unique to that group only (this isn't the case in my example here obviously).

Thanks so much!

CodePudding user response:

A binary output can be generated with

any(duplicated(unlist(my_list)))
[1] TRUE

As pointed out correctly in comments by @sindri_baldur, if duplicates appear in groups they should be handled with unique, if desired:

any(duplicated(unlist(lapply(my_list, unique))))
[1] TRUE

or another base R alternative

anyDuplicated(unlist(lapply(my_list, unique))) > 1
[1] TRUE

CodePudding user response:

You could do:

subset(stack(my_list), duplicated(values))$values
[1] "e"

If you need to tell whether all the values in a group are unique to that group, you could do:

result <- setNames(logical(length(my_list)), names(my_list))

result[unique(unlist(Filter(\(x)length(x)>1,
                            unstack(rev(stack(my_list))))))] <- TRUE
result
group1 group2 group3 
  TRUE   TRUE  FALSE 

or even:

stack(my_list) %>%
  mutate(dups = duplicated(values) | duplicated(values, f = T)) %>%
  group_by(ind) %>%
  summarise(logic = any(dups))

# A tibble: 3 x 2
  ind    logic
  <fct>  <lgl>
1 group1 TRUE 
2 group2 TRUE 
3 group3 FALSE

CodePudding user response:

We can stack the named list to a two column data.frame, get the frequency count with table, check for duplicates by column with colSums on a logical vector and return with the names that are occuring more than 1

names(which(colSums(table(stack(my_list)[2:1])> 0) > 1))
[1] "e"

Or slighly more compact

 names(which(table(unlist(my_list)) > 1))
[1] "e"

If we want a logical column

library(dplyr)
library(tidyr)
library(tibble)
enframe(my_list) %>%
   unnest(value) %>% 
  group_by(value) %>%
   mutate(flag = any(n_distinct(name) > 1)) %>% 
 group_by(name) %>% 
  summarise(flag = any(flag))

-output

# A tibble: 3 × 2
  name   flag 
  <chr>  <lgl>
1 group1 TRUE 
2 group2 TRUE 
3 group3 FALSE

CodePudding user response:

Another possible solution, based on tidyr::expand_grid and purrr::pmap_lgl:

library(tidyverse)

my_list <- list(c('a','b','c','d','e'),c('e','f','g'),c('h','i','j'))
names(my_list) <- c("group1","group2","group3")

expandg <- expand_grid(names(my_list), names(my_list))

pmap_lgl(expandg, ~ any(my_list[[.x]] %in% my_list[[.y]])) %>% 
  bind_cols(id1 = expandg[[1]], id2 = expandg[[2]], value = .) %>% 
  group_by(Group = id1) %>% summarise(value = any(value[id1 != id2]))

#> # A tibble: 3 × 2
#>   Group  value
#>   <chr>  <lgl>
#> 1 group1 TRUE 
#> 2 group2 TRUE 
#> 3 group3 FALSE
  • Related