I have two vectors as follows in R
:
A <- data.frame(c("Absolute Value", "absolute deviation", "acceptance line ; acceptance boundary", "age-adjusted rate", "variance", "modified mean ; modified arithmetic mean ; trimmed mean "))
B <- data.frame(c("descriptive", "Acceptance Boundary", "deviation", "modified arithmetic mean", "mutability ; variability"))
I want to compare the two vectors and create a vector C
with the terms of vector B
that are not in the vector A
. I want the code to ignore the capital letters, i.e. to recognise that Acceptance Boundary
and acceptance boundary
is the same and if the term appears in more than one way (;), e.g., acceptance line ; acceptance boundary
, to recognise it as the same.
I want the final result to be:
C <- data.frame(c("descriptive", "deviation", "mutability ; variability"))
Can someone help me?
CodePudding user response:
Here's a tidyverse
solution:
First create vector a
containing all (separate) expressions in A
as an alternation pattern:
a <- A %>%
# separate expressions into distinct rows:
separate_rows(x, sep = " ; ") %>%
# define the result as a vector:
pull(x) %>%
# connect the elements in the vector with alternation marker '|':
str_c(., collapse = "|")
Then match the expressions in a
to the (separate) expressions in B
:
B %>%
# separate the expressions into their own row each:
separate_rows(x, sep = " ; ") %>%
# match `x` to `a`& extract matches into new column:
mutate(matches = str_extract_all(x, str_c("(?i)", a))) %>%
# unnest the listed items in the new column:
unnest(where(is.list), keep_empty = TRUE) %>%
# filter (i.e., retain only) the non-matches:
filter(is.na(matches)) %>%
# deselect the no-longer necessary column:
select(-matches)
# A tibble: 4 × 1
x
<chr>
1 descriptive
2 deviation
3 mutability
4 variability
Data (note the use of x
as column name):
A <- data.frame(x = c("Absolute Value", "absolute deviation", "acceptance line ; acceptance boundary", "age-adjusted rate", "variance", "modified mean ; modified arithmetic mean ; trimmed mean "))
B <- data.frame(x = c("descriptive", "Acceptance Boundary", "deviation", "modified arithmetic mean", "mutability ; variability"))