Home > OS >  Can't seem to get stringr() just right for mid-string extraction
Can't seem to get stringr() just right for mid-string extraction

Time:05-03

I want to extract titles (Mr, Mrs, Miss) from within the Name column and import those extracted titles into a new column Title. Relevant data looks like this:

snippet <- data_frame(Name=c('Braund, Mr. Owen Harris','Cumings, Mrs. John Bradley','Heikkinen, Miss. Laina'),Column=c('blah','blah,'blah'))

I've reviewed this answer, but I must be missing something.

Here's the best code I could come up with: snippet <- mutate(snippet, Title = str_extract(snippet $Name, "(?<=,)[^,]*(?=.)"). This does add the Title column, but all values within that column are NA. Where's my error? Thanks.

CodePudding user response:

Maybe this helps - in the column 'Name', there is a space after the ,, so we use regex lookaround to match non-whitespace characters (\\S ) that succeeds after the , and space ((?<=, )) and precedes the . (. is metacharacter so we escape or else it matches any character)

library(dplyr)
library(stringr)
snippet <- snippet %>% 
  mutate(Title = str_extract(Name, "(?<=, )\\S (?=\\.)"))

-output

snippet
# A tibble: 3 × 3
  Name                       Column Title
  <chr>                      <chr>  <chr>
1 Braund, Mr. Owen Harris    blah   Mr   
2 Cumings, Mrs. John Bradley blah   Mrs  
3 Heikkinen, Miss. Laina     blah   Miss 

data

snippet <- structure(list(Name = c("Braund, Mr. Owen Harris", 
"Cumings, Mrs. John Bradley", 
"Heikkinen, Miss. Laina"), Column = c("blah", "blah", "blah")), 
class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -3L))
  • Related