Home > Back-end >  Compare two vectors within a data frame with %in% with R
Compare two vectors within a data frame with %in% with R

Time:07-16

Compare two vectors within a data frame with %in%

I have the following data

T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )

Col1 Col2
a a,b,c
b aa,c,d
aa c,d,e
d d,f,g

I want to select the rows that contain a character from this vector c("a", "e", "g"), specifying the columna

library(dplyr)

T1 %>% filter(Col1 %in% c("a", "e", "g"))

I returned

1 a a,b,c

It is correct, but if I want to compare two vectors, example:

With unlist and strsplit, I transform the value of each row to a character vector and try to compare it with the reference vector to select the rows that contain any of the values:

unlist(strsplit(T1$Col2[1],","))

[1] "a" "b" "c"

T1 %>% filter(unlist(strsplit(Col2,",")) %in% c("a", "e", "g"))

It gives me an error: Error in filter(): ! Problem while computing ..1 = unlist(strsplit(Col2, ",")) %in% c("a", "e", "g"). ✖ Input ..1 must be of size 4 or 1, not size 12. Run ]8;;rstudio:run:rlang::last_error()rlang::last_error() ]8;; to see where the error occurred.

I can do it like this:

T1[grep(c("a|e|g"), T1$Col2),]

1 a a,b,c

2 b aa,c,d

3 aa c,d,e

4 d d,f,g

But it's wrong, row 3 aa c,d,e, shouldn't be, because it's not a, it's aa

To search for the "a" alone, you would have to do:

T1[grep(c("\\<a\\>"), T1$Col2),]

I think that with this form I will end up making a mistake, it would give me more security to be able to do it comparing vector with vector:

T1 %>% filter(unlist(strsplit(Col2,",")) %in% c("a", "e", "g"))

CodePudding user response:

Edited answer

You can use the syntax \\b for regular expressions word boundary. The | is for boundaries adjacent to like an or operation. You can use the following code:

T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
library(dplyr)
library(stringr)
T1 %>% 
  filter(grepl("\\b(a|e|g)\\b", Col2))
#>   Col1  Col2
#> 1    a a,b,c
#> 2   aa c,d,e
#> 3    d d,f,g

Created on 2022-07-16 by the reprex package (v2.0.1)

Note: \\b is for R version 4.1 otherwise use \b.

old answer

It returns all rows back because you check if one of the strings exists in Col2 and you can see that in row 3, "e" exists which is one of the strings and that's why it returns also row 4. You could also use str_detect like this:

library(dplyr)
library(stringr)
T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
vector <- c("a", "e", "g")
T1 %>%  
  filter(any(str_detect(Col2, paste0(vector, collapse="|"))))
#>   Col1   Col2
#> 1    a  a,b,c
#> 2    b aa,c,d
#> 3   aa  c,d,e
#> 4    d  d,f,g

Created on 2022-07-16 by the reprex package (v2.0.1)

If you want to check if the strings exists, one of them, in both columns. You can use the following code:

library(dplyr)
library(stringr)
T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
vector <- c("a", "e", "g")
T1 %>% 
  filter(Reduce(`|`, across(all_of(colnames(T1)), ~str_detect(paste0(vector, collapse="|"), .x))))
#>   Col1  Col2
#> 1    a a,b,c

Created on 2022-07-16 by the reprex package (v2.0.1)

CodePudding user response:

Another way you could achieve this (using your original approach with strsplit) is to do it rowwise() and 'sum' the logical test.

T1 %>% 
  rowwise() %>% 
  filter(sum(unlist(strsplit(Col2,",")) %in% c("a","e","g")) >= 1)
  • Related