I've looked at several SO posts but I'm at the "beating head against tree" stage. I appreciate your time.
I have a dataframe (about 300 cases) with a text string; I simply want to scan a separate list of cities (7000 of them), and if city in the string matches the list, I want a new data frame column to be written with the matching city name.
My data:
df<-structure(list(Item = c("1965 Wilkes College, Wilkes-Barre, PA", "1967 Spanish National Tourist Office, New York City", "1968 William Penn Memorial Museum, Harrisburg, PA",
"2010 Strange Evidence. Philadelphia Museum of Art. Peter Barbarie, Curator.","1973 The Museum of Modern Art, New York City", "1974 International Museum of Photography, George Eastman House, Rochester, NY", "1974 Light Gallery, New York City", "1975 Art Institute of Chicago"
)), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"
))
citylist<-c("Barre","Sacramento","Palmer", "New York City","Chicago","Rochester")
I've tried things like:
df$city<-sapply(df$Item,function(x) df$Item[df$Item %in% citylist])
Or
citymatcher<-function(x){match(x,citylist)}
df$city<-sapply(df$Item,citymatcher)
Ultimately, I want a tidy dataframe with a new column indicating the city that matches a given row.
CodePudding user response:
We could use str_match
:
library(stringr)
library(dplyr)
pattern <- paste(citylist, collapse = '|')
mutate(df, city = str_match(Item, pattern))
Item city_1[,1]
<chr> <chr>
1 1965 Wilkes College, Wilkes-Barre, PA Barre
2 1967 Spanish National Tourist Office, New York City New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA NA
4 1973 The Museum of Modern Art, New York City New York City
5 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester
6 1974 Light Gallery, New York City New York City
7 1975 Art Institute of Chicago Chicago
>
CodePudding user response:
We could use regex_left_join
(from fuzzyjoin
) after creating a data.frame/tibble
from the 'citylist' vector
library(fuzzyjoin)
regex_left_join(df, tibble(city = citylist), by = c("Item" = "city"))
# A tibble: 7 × 2
Item city
<chr> <chr>
1 1965 Wilkes College, Wilkes-Barre, PA Barre
2 1967 Spanish National Tourist Office, New York City New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA <NA>
4 1973 The Museum of Modern Art, New York City New York City
5 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester
6 1974 Light Gallery, New York City New York City
7 1975 Art Institute of Chicago Chicago
Just to avoid any partial substring match, use word boundary (\\b
)
library(stringr)
library(dplyr)
regex_left_join(df, tibble(city = sprintf("\\b%s\\b",
citylist)), by = c("Item" = "city")) %>%
mutate(city = str_remove_all(city, fixed("\\b")))
-ouptut
# A tibble: 8 × 2
Item city
<chr> <chr>
1 1965 Wilkes College, Wilkes-Barre, PA Barre
2 1967 Spanish National Tourist Office, New York City New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA <NA>
4 2010 Strange Evidence. Philadelphia Museum of Art. Peter Barbarie, Curator. <NA>
5 1973 The Museum of Modern Art, New York City New York City
6 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester
7 1974 Light Gallery, New York City New York City
8 1975 Art Institute of Chicago Chicago
CodePudding user response:
Is this what you're looking for?
library(tidyverse)
# collapse citylist into a regex search string
city_regex <- str_c(citylist, collapse = "|")
# extract matching values using stringr::str_extract
df <- df %>%
mutate(city = str_extract(Item, city_regex))
Output:
# A tibble: 7 x 2
Item city
<chr> <chr>
1 1965 Wilkes College, Wilkes-Barre, PA Barre
2 1967 Spanish National Tourist Office, New York City New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA NA
4 1973 The Museum of Modern Art, New York City New York City
5 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester
6 1974 Light Gallery, New York City New York City
7 1975 Art Institute of Chicago Chicago
CodePudding user response:
Possible base R solution:
citylist2 <- paste0(citylist, collapse = "|")
df$city <- ifelse(grepl(citylist2, df$Item), sub(paste0(".*(", citylist2, ").*"), "\\1", df$Item), NA)
Or we could use regexpr
and regmatches
:
citylist2 <- paste0(citylist, collapse = "|")
df$city <- NA
df[grepl(citylist2, df$Item),]$city <- regmatches(df$Item, regexpr(citylist2, df$Item))
Output
Item city
<chr> <chr>
1 1965 Wilkes College, Wilkes-Barre, PA Barre
2 1967 Spanish National Tourist Office, New York City New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA NA
4 1973 The Museum of Modern Art, New York City New York City
5 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester
6 1974 Light Gallery, New York City New York City
7 1975 Art Institute of Chicago Chicago
Benchmark