Home > Software engineering >  Subtracting Two Strings in R
Subtracting Two Strings in R

Time:07-07

I have this data in R:

string_1 = c("newyork 123", "california 123", "washington 123")
string_2 = c("123 red", "123 blue", "123 green")
my_data = data.frame(string_1, string_2)

I want to "subtract" string_2 from string_1. The result would look something like this:

"newyork", "california", "washington"

I tried to do this:

library(tidyverse)

# did not work as planned
> str_remove(string_1, "string_2")

[1] "newyork 123"    "california 123" "washington 123"

But this is not performing a "full" subtraction.

  • Does anyone know how to do this?
  • Should I try to do this with an ANTI JOIN in SQL?

Thank you!

CodePudding user response:

You could split both strings and find the set difference of them.

mapply(setdiff, strsplit(string_1, "\\s "), strsplit(string_2, "\\s "))

# [1] "newyork"    "california" "washington"

CodePudding user response:

library(purrr)

list1 <- str_split(string_1, pattern = " ")
list2 <- str_split(string_2, pattern = " ")

a <- map2(list1, list2, function(x, y){
    
    output <- setdiff(x, y)
    return(output)
  }) %>% unlist()

CodePudding user response:

library(dplyr)
library(purrr)
library(stringr)

string_1 <- c("newyork 123", "california 123", "washington 123")
string_2 <- c("123 red", "123 blue", "123 green")

my_data <- data.frame(string_1, string_2)

my_data %>%
    mutate(
        subtracted = map2(
            str_split(string_1, "\\s "),
            str_split(string_2, "\\s "),
            ~ setdiff(.x, .y)
        ) %>% map_chr(~ paste0(.x, collapse = " "))
    )

#>         string_1  string_2 subtracted
#> 1    newyork 123   123 red    newyork
#> 2 california 123  123 blue california
#> 3 washington 123 123 green washington

If I change the string_2 as @DarrenTsai suggested, we also get what we intended to get

string_1 <- c("newyork 123", "california 123", "washington 123")
string_2_test <- c("123 red", "456 blue", "789 green")

my_data <- data.frame(string_1, string_2_test)

my_data %>%
    mutate(
        subtracted = map2(
            str_split(string_1, "\\s "),
            str_split(string_2_test, "\\s "),
            ~ setdiff(.x, .y)
        ) %>% map_chr(~ paste0(.x, collapse = " "))
    )

#>         string_1 string_2_test     subtracted
#> 1    newyork 123       123 red        newyork
#> 2 california 123      456 blue california 123
#> 3 washington 123     789 green washington 123

Created on 2022-07-07 by the reprex package (v2.0.1)

CodePudding user response:

Another option with tidyverse, where you split the string_2 for each row, then collapse into a string that we can use to search for any of the words (i.e., using | as "or"; so e.g., "123" or "red", etc.), then remove those using str_remove_all. Then, we can pull the string_1 column with the deletions.

library(tidyverse)

my_data %>%
  rowwise() %>%
  mutate(string_1 = trimws(str_remove_all(string_1, str_c(
    unlist(str_split(string_2, " ")), collapse = "|")))) %>%
  pull(string_1)

Output

[1] "newyork"    "california" "washington"
  • Related