I would like to replace/remove those parts of a string (name
) that match to other columns (state
and city
) in my data table.
I managed to identify the rows, e.g. with city, like so:
dt%>% filter(str_detect(name, city))
but I am missing a way to use gsub
(or grep
) with the rowwise value of the column city.
I know that a rather manual approach like storing all city names in a vector and enter them in gsub
would work but it would also falsely remove the "dallas" of row 2. (This was manageable for states though and could be combined with gsub to also remove "of".)
Data and desired output
dt<- data.table(city = c("arecibo","arecibo","cabo rojo", "new york", "dallas"),
state=c("pr", "pr", "pr", "ny", "tx"),
name=c("frutas of pr arecibo", "dallas frutas of pr", "cabo rojo metal plant", "greens new york", "cowboy shoes dallas tx"),
desired=c("frutas", "dallas frutas", "metal plant", "greens", "cowboy shoes"))
CodePudding user response:
Here's a solution, but it can probably be achieved faster with gsub
methods. Anyway :
library(tidyverse)
dt %>%
mutate(test = str_remove_all(name,city)) %>%
mutate(test = str_remove_all(test,paste(" of ",state,sep=""))) %>%
mutate(test = str_remove_all(test,state)) %>%
mutate(test = str_remove_all(test,"^ ")) %>%
mutate(test = str_remove_all(test," *$"))
Output:
city state name desired test
1: arecibo pr frutas of pr arecibo frutas frutas
2: arecibo pr dallas frutas of pr dallas frutas dallas frutas
3: cabo rojo pr cabo rojo metal plant metal plant metal plant
4: new york ny greens new york greens greens
5: dallas tx cowboy shoes dallas tx cowboy shoes cowboy shoes
CodePudding user response:
With dplyr, we can use rowwise
. First collapse all words to remove into a single character element with the OR metacharacter (as in 'arecibo|pr|of'
), then call str_remove_all
with that pattern.
Finally, remove remaining whitespaces.
library(dplyr)
library(stringr)
dt %>%
rowwise()%>%
mutate(desired_2 = str_remove_all(name, paste(c(city, state, 'of'), collapse = '|'))%>%
trimws())
# A tibble: 5 × 5
# Rowwise:
city state name desired desired_2
<chr> <chr> <chr> <chr> <chr>
1 arecibo pr frutas of pr arecibo frutas frutas
2 arecibo pr dallas frutas of pr dallas frutas dallas frutas
3 cabo rojo pr cabo rojo metal plant metal plant metal plant
4 new york ny greens new york greens greens
5 dallas tx cowboy shoes dallas tx cowboy shoes cowboy shoes
CodePudding user response:
A data.table
solution:
# Helper function
subxy <- function(string, rmv) mapply(function(x, y) sub(x, '', y), rmv, string)
dt[, desired2 := name |> subxy(city) |> subxy(state) |> subxy('of') |> trimws()]
# city state name desired desired2
# 1: arecibo pr frutas of pr arecibo frutas frutas
# 2: arecibo pr dallas frutas of pr dallas frutas dallas frutas
# 3: cabo rojo pr cabo rojo metal plant metal plant metal plant
# 4: new york ny greens new york greens greens
# 5: dallas tx cowboy shoes dallas tx cowboy shoes cowboy shoes