I have this data in R:
string_1 = c("newyork 123", "california 123", "washington 123")
string_2 = c("123 red", "123 blue", "123 green")
my_data = data.frame(string_1, string_2)
I want to "subtract" string_2
from string_1
. The result would look something like this:
"newyork", "california", "washington"
I tried to do this:
library(tidyverse)
# did not work as planned
> str_remove(string_1, "string_2")
[1] "newyork 123" "california 123" "washington 123"
But this is not performing a "full" subtraction.
- Does anyone know how to do this?
- Should I try to do this with an ANTI JOIN in SQL?
Thank you!
CodePudding user response:
You could split both strings and find the set difference of them.
mapply(setdiff, strsplit(string_1, "\\s "), strsplit(string_2, "\\s "))
# [1] "newyork" "california" "washington"
CodePudding user response:
library(purrr)
list1 <- str_split(string_1, pattern = " ")
list2 <- str_split(string_2, pattern = " ")
a <- map2(list1, list2, function(x, y){
output <- setdiff(x, y)
return(output)
}) %>% unlist()
CodePudding user response:
library(dplyr)
library(purrr)
library(stringr)
string_1 <- c("newyork 123", "california 123", "washington 123")
string_2 <- c("123 red", "123 blue", "123 green")
my_data <- data.frame(string_1, string_2)
my_data %>%
mutate(
subtracted = map2(
str_split(string_1, "\\s "),
str_split(string_2, "\\s "),
~ setdiff(.x, .y)
) %>% map_chr(~ paste0(.x, collapse = " "))
)
#> string_1 string_2 subtracted
#> 1 newyork 123 123 red newyork
#> 2 california 123 123 blue california
#> 3 washington 123 123 green washington
If I change the string_2 as @DarrenTsai suggested, we also get what we intended to get
string_1 <- c("newyork 123", "california 123", "washington 123")
string_2_test <- c("123 red", "456 blue", "789 green")
my_data <- data.frame(string_1, string_2_test)
my_data %>%
mutate(
subtracted = map2(
str_split(string_1, "\\s "),
str_split(string_2_test, "\\s "),
~ setdiff(.x, .y)
) %>% map_chr(~ paste0(.x, collapse = " "))
)
#> string_1 string_2_test subtracted
#> 1 newyork 123 123 red newyork
#> 2 california 123 456 blue california 123
#> 3 washington 123 789 green washington 123
Created on 2022-07-07 by the reprex package (v2.0.1)
CodePudding user response:
Another option with tidyverse
, where you split the string_2
for each row, then collapse into a string that we can use to search for any of the words (i.e., using |
as "or"; so e.g., "123" or "red", etc.), then remove those using str_remove_all
. Then, we can pull
the string_1
column with the deletions.
library(tidyverse)
my_data %>%
rowwise() %>%
mutate(string_1 = trimws(str_remove_all(string_1, str_c(
unlist(str_split(string_2, " ")), collapse = "|")))) %>%
pull(string_1)
Output
[1] "newyork" "california" "washington"