Hi I have a column with names. It has names in the format of: c("Tom", "Tom Turner", "Dr. Tom Turner", "R. Tom Turner", "J Tom Turner", "Jr. Tom Turner"). I just want to extract the first name but I am not exactly how to do it in an easy way due to the prefixes on the names. Please let me know if you have any suggestions.

CodePudding user response：

This is an approach:

library(magrittr) # for %>% 

dirty_names  <- c(
    "Tom",
    "Tom Turner",
    "Dr. Tom Turner",
    "R. Tom Turner",
    "J Tom Turner",
    "Jr. Tom Turner"
)

dirty_names   %>% 
    # Remove first word if it ends with . e.g. Dr., Jr., R.
    sub("^\\w \\.", "", .)   %>% 
    trimws()  %>% 
    # Remove first word if it is one letter e.g. J
    sub("^[A-Za-z] ", "", .)    %>% 
    # Delete everything after first word
    sub("(\\w ).*", "\\1", .)
# [1] "Tom" "Tom" "Tom" "Tom" "Tom" "Tom"

CodePudding user response：

Solution

Here is a solution in the tidyverse, which uses regular expressions ("regex") to extract every component of interest:

Optional prefix: either a single letter (J), or several letters followed by a period (Jr.); separated from the ensuing name by whitespace ( ).
Required first_name: a "streak" of characters before the next whitespace.
Optional last_name: a "streak" of characters after that next whitespace.

# Load useful functions.
library(tidyverse)


# ...
# Code to generate a 'dirty_data' table with a 'dirty_name' column.
# ...


# Define the regex for extracting the name components, each within a (capture group).
dirty_regex <-
  #                      Prefix                                     Next Whitespace
  # |----------------------------------------------|                |------------|
  "^((([[:alpha:]])|([[:alpha:]] \\.))[[:blank:]] )?([^[:blank:]] )(([[:blank:]]*)(.*))?$"
  #                                                 |-------------|               |--|
  #                                                   First Name               Last Name


# Clean the 'dirty_data' and store it in a fresh table: 'clean_data'.
clean_data <- dirty_data %>%
  mutate(
    # Remove external whitespace for easier analysis.
    clean_full_name = str_trim(dirty_name),
    
    # Break the dirty names (using regex) into a matrix of their components.
    name_components = str_match(dirty_name, dirty_regex),
    
    # Extract each component.
    clean_prefix = name_components[, 2],
    clean_first_name = name_components[, 6],
    clean_last_name = name_components[, 9],
    
    # Remove the matrix.
    name_components = NULL,
    
    # Trim any external whitespace in the (new) components.
    across(starts_with("clean_") & !clean_full_name, str_trim),
    
    # Replace any empty strings ("") with blanks (NAs).
    across(starts_with("clean_"), na_if, y = "")
  )


# Print and inspect our result.
clean_data

Result

Given data like your dirty_data below

# The dirty names.
dirty_names_vec <- c("Tom", "Tom Turner", "Dr. Tom Turner", "R. Tom Turner", "J Tom Turner", "Jr. Tom Turner")

# A table with a column for the dirty names.
dirty_data <- tibble(dirty_name = dirty_names_vec)

this workflow should yield the following result for clean_data:

# A tibble: 6 × 5
  dirty_name     clean_full_name clean_prefix clean_first_name clean_last_name
  <chr>          <chr>           <chr>        <chr>            <chr>          
1 Tom            Tom             NA           Tom              NA             
2 Tom Turner     Tom Turner      NA           Tom              Turner         
3 Dr. Tom Turner Dr. Tom Turner  Dr.          Tom              Turner         
4 R. Tom Turner  R. Tom Turner   R.           Tom              Turner         
5 J Tom Turner   J Tom Turner    J            Tom              Turner         
6 Jr. Tom Turner Jr. Tom Turner  Jr.          Tom              Turner

Note

If other "dirty" names are in different formats, you must modify your dirty_regex accordingly. You should likewise adjust the index i of each capture group, used to extract the components via clean_* = name_components[, i].

See str_match() from the stringr package, for extracting components in "capture groups". For further information on defining those groups, see regular expressions with stringr.