R: Replace string when partial match to another column by row-CodePudding

I would like to replace/remove those parts of a string (name) that match to other columns (stateand city) in my data table.

I managed to identify the rows, e.g. with city, like so: dt%>% filter(str_detect(name, city)) but I am missing a way to use gsub (or grep) with the rowwise value of the column city.

I know that a rather manual approach like storing all city names in a vector and enter them in gsub would work but it would also falsely remove the "dallas" of row 2. (This was manageable for states though and could be combined with gsub to also remove "of".)

Data and desired output

dt<- data.table(city = c("arecibo","arecibo","cabo rojo", "new york", "dallas"), 
state=c("pr", "pr", "pr", "ny", "tx"), 
name=c("frutas of pr arecibo", "dallas frutas of pr", "cabo rojo metal plant", "greens new york", "cowboy shoes dallas tx"), 
desired=c("frutas", "dallas frutas", "metal plant", "greens", "cowboy shoes"))

CodePudding user response：

Here's a solution, but it can probably be achieved faster with gsub methods. Anyway :

library(tidyverse)


  dt %>% 
  mutate(test = str_remove_all(name,city)) %>% 
  mutate(test = str_remove_all(test,paste(" of ",state,sep=""))) %>% 
  mutate(test = str_remove_all(test,state)) %>% 
  mutate(test = str_remove_all(test,"^ ")) %>% 
  mutate(test = str_remove_all(test," *$"))

Output:

        city state                   name       desired          test
1:   arecibo    pr   frutas of pr arecibo        frutas        frutas
2:   arecibo    pr    dallas frutas of pr dallas frutas dallas frutas
3: cabo rojo    pr  cabo rojo metal plant   metal plant   metal plant
4:  new york    ny        greens new york        greens        greens
5:    dallas    tx cowboy shoes dallas tx  cowboy shoes  cowboy shoes

CodePudding user response：

With dplyr, we can use rowwise. First collapse all words to remove into a single character element with the OR metacharacter (as in 'arecibo|pr|of'), then call str_remove_all with that pattern. Finally, remove remaining whitespaces.

library(dplyr)
library(stringr)

dt %>%
    rowwise()%>%
    mutate(desired_2 = str_remove_all(name, paste(c(city, state, 'of'), collapse = '|'))%>%
               trimws())

# A tibble: 5 × 5
# Rowwise: 
  city      state name                   desired       desired_2    
  <chr>     <chr> <chr>                  <chr>         <chr>        
1 arecibo   pr    frutas of pr arecibo   frutas        frutas       
2 arecibo   pr    dallas frutas of pr    dallas frutas dallas frutas
3 cabo rojo pr    cabo rojo metal plant  metal plant   metal plant  
4 new york  ny    greens new york        greens        greens       
5 dallas    tx    cowboy shoes dallas tx cowboy shoes  cowboy shoes

CodePudding user response：

A data.table solution:

# Helper function
subxy <-  function(string, rmv) mapply(function(x, y) sub(x, '', y), rmv, string)

dt[,  desired2 := name |> subxy(city) |> subxy(state) |> subxy('of') |> trimws()]

#         city state                   name       desired      desired2
# 1:   arecibo    pr   frutas of pr arecibo        frutas        frutas
# 2:   arecibo    pr    dallas frutas of pr dallas frutas dallas frutas
# 3: cabo rojo    pr  cabo rojo metal plant   metal plant   metal plant
# 4:  new york    ny        greens new york        greens        greens
# 5:    dallas    tx cowboy shoes dallas tx  cowboy shoes  cowboy shoes