I have two vectors with terms as follows in R
:
A <- data.frame(c("Absolute Value", "absolute deviation", "acceptance line ; acceptance boundary", "age-adjusted rate", "variance", "modified mean ; modified arithmetic mean ; trimmed mean ", "standard error (stdev)"))
B <- data.frame(c("descriptive", "Acceptance Boundary", "deviation", "stdev", "modified arithmetic mean", "mutability"))
I want to compare the two vectors and create a vector C
with the terms of vector B
that are not in the vector A
. I want the code to ignore the capital letters, i.e. to recognise that Acceptance Boundary
and acceptance boundary
is the same and if the term appears in more than one way (;), e.g., (a) acceptance line ; acceptance boundary
, or (b) "standard error (stdev)"
and "stdev"
to recognise it as the same.
I want the final result to be:
C <- data.frame(c("descriptive", "deviation", "mutability"))
In a similar question (enter link description here), Chris provided a solution, however I couldn't adjust my code properly in order to make it work in this question's case.
CodePudding user response:
If you want to adapt the previous solution to the new data:
a <- A %>%
# separate expressions into distinct rows:
separate_rows(x, sep = " ; | \\(") %>% # New
# remove trailing ): # New
mutate(x = str_remove(x, "\\)")) %>% # New
# define the result as a vector:
pull(x) %>%
# connect the elements in the vector with alternation marker '|':
str_c(., collapse = "|")
Then match the expressions in a
to the (separate) expressions in B:
B %>%
# separate the expressions into their own row each:
#separate_rows(x, sep = " ; ") %>%
# match `x` to `a` & extract matches into new column:
mutate(matches = str_extract_all(x, str_c("(?i)", a))) %>%
# unnest the listed items in the new column:
unnest(where(is.list), keep_empty = TRUE) %>%
# filter (i.e., retain only) the non-matches:
filter(is.na(matches)) %>%
# deselect the no-longer necessary column:
select(-matches)
CodePudding user response:
If A
and B
are vectors (not data frames as in your example), then you can use strsplit()
and other helper functions like (tolower()
and trimws()
) to separate the values of A
into separate words/concepts. Then use setdiff()
to find the differences between B
and your cleaned set of words/concepts:
Avals = gsub("\\)", "", trimws(tolower(unlist(strsplit(A,"( ; )|( \\()")))))
setdiff(trimws(tolower(B)),Avals)
Output:
"descriptive" "deviation" "mutability"
Input:
A = c("Absolute Value", "absolute deviation", "acceptance line ; acceptance boundary",
"age-adjusted rate", "variance", "modified mean ; modified arithmetic mean ; trimmed mean ",
"standard error (stdev)")
B = c("descriptive", "Acceptance Boundary", "deviation", "stdev",
"modified arithmetic mean", "mutability")