Home > Enterprise >  Subsetting text based on differences between two vectors in R
Subsetting text based on differences between two vectors in R

Time:12-19

I have two vectors with terms as follows in R:

A <- data.frame(c("Absolute Value", "absolute deviation", "acceptance line ; acceptance boundary", "age-adjusted rate", "variance", "modified mean ; modified arithmetic mean ; trimmed mean ", "standard error (stdev)"))
B  <- data.frame(c("descriptive", "Acceptance Boundary", "deviation", "stdev", "modified arithmetic mean", "mutability"))

I want to compare the two vectors and create a vector C with the terms of vector B that are not in the vector A. I want the code to ignore the capital letters, i.e. to recognise that Acceptance Boundary and acceptance boundary is the same and if the term appears in more than one way (;), e.g., (a) acceptance line ; acceptance boundary, or (b) "standard error (stdev)" and "stdev" to recognise it as the same.

I want the final result to be:

C <- data.frame(c("descriptive", "deviation", "mutability")) 

In a similar question (enter link description here), Chris provided a solution, however I couldn't adjust my code properly in order to make it work in this question's case.

CodePudding user response:

If you want to adapt the previous solution to the new data:

a <- A %>%
  # separate expressions into distinct rows:
  separate_rows(x, sep = " ; | \\(") %>%   # New
  # remove trailing ):                     # New
  mutate(x = str_remove(x, "\\)")) %>%     # New
  # define the result as a vector:
  pull(x) %>%
  # connect the elements in the vector with alternation marker '|':
  str_c(., collapse = "|")
    

Then match the expressions in a to the (separate) expressions in B:

B %>%
  # separate the expressions into their own row each:
  #separate_rows(x, sep = " ; ") %>%
  # match `x` to `a` & extract matches into new column:
  mutate(matches = str_extract_all(x, str_c("(?i)", a))) %>% 
  # unnest the listed items in the new column:
  unnest(where(is.list), keep_empty = TRUE) %>%
  # filter (i.e., retain only) the non-matches:
  filter(is.na(matches)) %>%
  # deselect the no-longer necessary column:
  select(-matches)

CodePudding user response:

If A and B are vectors (not data frames as in your example), then you can use strsplit() and other helper functions like (tolower() and trimws()) to separate the values of A into separate words/concepts. Then use setdiff() to find the differences between B and your cleaned set of words/concepts:

Avals = gsub("\\)", "", trimws(tolower(unlist(strsplit(A,"( ; )|( \\()")))))
setdiff(trimws(tolower(B)),Avals)               

Output:

"descriptive" "deviation"   "mutability" 

Input:

A = c("Absolute Value", "absolute deviation", "acceptance line ; acceptance boundary", 
"age-adjusted rate", "variance", "modified mean ; modified arithmetic mean ; trimmed mean ", 
"standard error (stdev)")

B = c("descriptive", "Acceptance Boundary", "deviation", "stdev", 
"modified arithmetic mean", "mutability")
  • Related