Home > other >  Return Matched Value to Dataframe from Uneven List
Return Matched Value to Dataframe from Uneven List

Time:02-24

I've looked at several SO posts but I'm at the "beating head against tree" stage. I appreciate your time.

I have a dataframe (about 300 cases) with a text string; I simply want to scan a separate list of cities (7000 of them), and if city in the string matches the list, I want a new data frame column to be written with the matching city name.

My data:

df<-structure(list(Item = c("1965 Wilkes College, Wilkes-Barre, PA", "1967 Spanish National Tourist Office, New York City", "1968 William Penn Memorial Museum, Harrisburg, PA", 
"2010 Strange Evidence. Philadelphia Museum of Art. Peter Barbarie, Curator.","1973 The Museum of Modern Art, New York City", "1974 International Museum of Photography, George Eastman House, Rochester, NY",  "1974 Light Gallery, New York City", "1975 Art Institute of Chicago"
    )), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"
    ))

citylist<-c("Barre","Sacramento","Palmer", "New York City","Chicago","Rochester")

I've tried things like:

 df$city<-sapply(df$Item,function(x) df$Item[df$Item %in% citylist])

Or

citymatcher<-function(x){match(x,citylist)} 
df$city<-sapply(df$Item,citymatcher)

Ultimately, I want a tidy dataframe with a new column indicating the city that matches a given row.

CodePudding user response:

We could use str_match:

library(stringr)
library(dplyr)

pattern <- paste(citylist, collapse = '|')

mutate(df, city = str_match(Item, pattern))
 Item                                                                          city_1[,1]   
  <chr>                                                                         <chr>        
1 1965 Wilkes College, Wilkes-Barre, PA                                         Barre        
2 1967 Spanish National Tourist Office, New York City                           New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA                             NA           
4 1973 The Museum of Modern Art, New York City                                  New York City
5 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester    
6 1974 Light Gallery, New York City                                             New York City
7 1975 Art Institute of Chicago                                                 Chicago      
> 

CodePudding user response:

We could use regex_left_join (from fuzzyjoin) after creating a data.frame/tibble from the 'citylist' vector

library(fuzzyjoin)
regex_left_join(df, tibble(city = citylist), by = c("Item" = "city"))
# A tibble: 7 × 2
  Item                                                                          city         
  <chr>                                                                         <chr>        
1 1965 Wilkes College, Wilkes-Barre, PA                                         Barre        
2 1967 Spanish National Tourist Office, New York City                           New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA                             <NA>         
4 1973 The Museum of Modern Art, New York City                                  New York City
5 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester    
6 1974 Light Gallery, New York City                                             New York City
7 1975 Art Institute of Chicago                                                 Chicago      

Just to avoid any partial substring match, use word boundary (\\b)

library(stringr)
library(dplyr) 
regex_left_join(df, tibble(city = sprintf("\\b%s\\b", 
       citylist)), by = c("Item" = "city")) %>%
    mutate(city = str_remove_all(city, fixed("\\b")))

-ouptut

# A tibble: 8 × 2
  Item                                                                          city         
  <chr>                                                                         <chr>        
1 1965 Wilkes College, Wilkes-Barre, PA                                         Barre        
2 1967 Spanish National Tourist Office, New York City                           New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA                             <NA>         
4 2010 Strange Evidence. Philadelphia Museum of Art. Peter Barbarie, Curator.   <NA>         
5 1973 The Museum of Modern Art, New York City                                  New York City
6 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester    
7 1974 Light Gallery, New York City                                             New York City
8 1975 Art Institute of Chicago                                                 Chicago      

CodePudding user response:

Is this what you're looking for?

library(tidyverse)

# collapse citylist into a regex search string
city_regex <- str_c(citylist, collapse = "|")

# extract matching values using stringr::str_extract
df <- df %>% 
  mutate(city = str_extract(Item, city_regex))

Output:

# A tibble: 7 x 2
  Item                                                                          city         
  <chr>                                                                         <chr>        
1 1965 Wilkes College, Wilkes-Barre, PA                                         Barre        
2 1967 Spanish National Tourist Office, New York City                           New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA                             NA           
4 1973 The Museum of Modern Art, New York City                                  New York City
5 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester    
6 1974 Light Gallery, New York City                                             New York City
7 1975 Art Institute of Chicago                                                 Chicago    

CodePudding user response:

Possible base R solution:

citylist2 <- paste0(citylist, collapse = "|")

df$city <- ifelse(grepl(citylist2, df$Item), sub(paste0(".*(", citylist2, ").*"), "\\1", df$Item), NA)

Or we could use regexpr and regmatches:

citylist2 <- paste0(citylist, collapse = "|")

df$city <- NA
df[grepl(citylist2, df$Item),]$city <- regmatches(df$Item, regexpr(citylist2, df$Item))

Output

  Item                                                                          city         
  <chr>                                                                         <chr>        
1 1965 Wilkes College, Wilkes-Barre, PA                                         Barre        
2 1967 Spanish National Tourist Office, New York City                           New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA                             NA           
4 1973 The Museum of Modern Art, New York City                                  New York City
5 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester    
6 1974 Light Gallery, New York City                                             New York City
7 1975 Art Institute of Chicago                                                 Chicago     

Benchmark

enter image description here

  • Related