Home > other >  Specify "ends with" in str_replace for fixing address abbreviations
Specify "ends with" in str_replace for fixing address abbreviations

Time:10-21

I have a df of addresses for schools for multiple years, but due to inconsistent data input, some schools have addresses written differently between years.

I am trying to fix them by using str_replace to change things like "St." into "Street", but sometimes I need to specify that the pattern is the end of the character vector. For example "West St" should become "West Street", but it I replace "St" with "Street", then the ones written correctly become "West Streetreet".

test <- tribble(
  ~ name,   ~ year,   ~address,
  "school 1",  2000,   "1 Main Ave",
  "school 1",  2001,   "1 Main Avenue",
  "school 1",  2002,   "1 Main Ave",
  "school 1",  2004,   "1 Main Avenue",
  "school 2",  2000,   "200 West St",
  "school 2",  2001,   "200 West Street",
  "school 2",  2002,   "200 West St",
  "school 2",  2004,   "200 West St",
  "school 3",  2000,   "2759 Lakeshore Road",
  "school 3",  2001,   "2759 Lakeshore Road",
  "school 3",  2002,   "2759 Lakeshore Rd",
  "school 3",  2004,   "2759 Lakeshore Rd"

)

test %>% 
  mutate(address2 = str_replace(address, "Rd","Road"),
         address2 = str_replace(address2, "Ave","Avenue"),
         address2 = str_replace(address2, "St","Street"))
  

this returns:

# A tibble: 12 × 4
   name      year address             address2           
   <chr>    <dbl> <chr>               <chr>              
 1 school 1  2000 1 Main Ave          1 Main Avenue      
 2 school 1  2001 1 Main Avenue       1 Main Avenuenue   
 3 school 1  2002 1 Main Ave          1 Main Avenue      
 4 school 1  2004 1 Main Avenue       1 Main Avenuenue   
 5 school 2  2000 200 West St         200 West Street    
 6 school 2  2001 200 West Street     200 West Streetreet
 7 school 2  2002 200 West St         200 West Street    
 8 school 2  2004 200 West St         200 West Street    
 9 school 3  2000 2759 Lakeshore Road 2759 Lakeshore Road
10 school 3  2001 2759 Lakeshore Road 2759 Lakeshore Road
11 school 3  2002 2759 Lakeshore Rd   2759 Lakeshore Road
12 school 3  2004 2759 Lakeshore Rd   2759 Lakeshore Road

which is obviously not correct. How do I specify that it's only when a pattern ENDS with "St", that it should be changed?

CodePudding user response:

"$" indicates the end of the string:

test %>% 
  mutate(address2 = str_replace(address, "Rd$","Road"),
         address2 = str_replace(address2, "Ave$","Avenue"),
         address2 = str_replace(address2, "St$","Street"))

CodePudding user response:

You can use \\b which indicates word boundary so St can be present anywhere in the string, it will be replaced only if it is a complete word in itself.

library(dplyr)
library(stringr)

test %>% 
  mutate(address2 = str_replace(address, "\\bRd\\b","Road"),
         address2 = str_replace(address2, "\\bAve\\b","Avenue"),
         address2 = str_replace(address2, "\\bSt\\b","Street"))

However, if you create a named vector with pattern and replacement to look for this is a one-liner with str_replace_all -

pat <- setNames(c("Road", "Avenue", "Street"), 
                c("\\bRd\\b", "\\bAve\\b", "\\bSt\\b"))

test %>% mutate(address2 = str_replace_all(address, pat))

#   name      year address             address2           
#   <chr>    <dbl> <chr>               <chr>              
# 1 school 1  2000 1 Main Ave          1 Main Avenue      
# 2 school 1  2001 1 Main Avenue       1 Main Avenue      
# 3 school 1  2002 1 Main Ave          1 Main Avenue      
# 4 school 1  2004 1 Main Avenue       1 Main Avenue      
# 5 school 2  2000 200 West St         200 West Street    
# 6 school 2  2001 200 West Street     200 West Street    
# 7 school 2  2002 200 West St         200 West Street    
# 8 school 2  2004 200 West St         200 West Street    
# 9 school 3  2000 2759 Lakeshore Road 2759 Lakeshore Road
#10 school 3  2001 2759 Lakeshore Road 2759 Lakeshore Road
#11 school 3  2002 2759 Lakeshore Rd   2759 Lakeshore Road
#12 school 3  2004 2759 Lakeshore Rd   2759 Lakeshore Road

CodePudding user response:

What you can do is the following: This is because your data has special organization: First create groups and then apply the longest string to all group rows as the full address must be longer as the abbreviation:

library(dplyr)
test %>% 
  group_by(name) %>% 
  mutate(address = address[which.max(str_length(address))]) %>% 
  ungroup()

   name      year address            
   <chr>    <dbl> <chr>              
 1 school 1  2000 1 Main Avenue      
 2 school 1  2002 1 Main Avenue      
 3 school 1  2001 1 Main Avenue      
 4 school 1  2004 1 Main Avenue      
 5 school 2  2000 200 West Street    
 6 school 2  2002 200 West Street    
 7 school 2  2004 200 West Street    
 8 school 2  2001 200 West Street    
 9 school 3  2002 2759 Lakeshore Road
10 school 3  2004 2759 Lakeshore Road
11 school 3  2000 2759 Lakeshore Road
12 school 3  2001 2759 Lakeshore Road
  • Related