Home > Enterprise >  Subsetting text based on matches between two vectors in R
Subsetting text based on matches between two vectors in R

Time:12-19

I have two vectors as follows in R:

A <- data.frame(c("Absolute Value", "absolute deviation", "acceptance line ; acceptance boundary", "age-adjusted rate", "variance", "modified mean ; modified arithmetic mean ; trimmed mean "))
B  <- data.frame(c("descriptive", "Acceptance Boundary", "deviation", "modified arithmetic mean", "mutability ; variability"))

I want to compare the two vectors and create a vector C with the terms of vector B that are not in the vector A. I want the code to ignore the capital letters, i.e. to recognise that Acceptance Boundary and acceptance boundary is the same and if the term appears in more than one way (;), e.g., acceptance line ; acceptance boundary, to recognise it as the same.

I want the final result to be:

C <- data.frame(c("descriptive", "deviation", "mutability ; variability")) 

Can someone help me?

CodePudding user response:

Here's a tidyverse solution:

First create vector a containing all (separate) expressions in A as an alternation pattern:

a <- A %>%
  # separate expressions into distinct rows:
  separate_rows(x, sep = " ; ") %>%
  # define the result as a vector:
  pull(x) %>%
  # connect the elements in the vector with alternation marker '|':
  str_c(., collapse = "|")

Then match the expressions in a to the (separate) expressions in B:

B %>%
  # separate the expressions into their own row each:
  separate_rows(x, sep = " ; ") %>%
  # match `x` to `a`& extract matches into new column:
  mutate(matches = str_extract_all(x, str_c("(?i)", a))) %>% 
  # unnest the listed items in the new column:
  unnest(where(is.list), keep_empty = TRUE) %>%
  # filter (i.e., retain only) the non-matches:
  filter(is.na(matches)) %>%
  # deselect the no-longer necessary column:
  select(-matches)
# A tibble: 4 × 1
  x          
  <chr>      
1 descriptive
2 deviation  
3 mutability 
4 variability

Data (note the use of x as column name):

A <- data.frame(x = c("Absolute Value", "absolute deviation", "acceptance line ; acceptance boundary", "age-adjusted rate", "variance", "modified mean ; modified arithmetic mean ; trimmed mean "))
B  <- data.frame(x = c("descriptive", "Acceptance Boundary", "deviation", "modified arithmetic mean", "mutability ; variability"))
  • Related