Home > Software engineering >  Cleaning names in R
Cleaning names in R

Time:08-17

Hi I have a column with names. It has names in the format of: c("Tom", "Tom Turner", "Dr. Tom Turner", "R. Tom Turner", "J Tom Turner", "Jr. Tom Turner"). I just want to extract the first name but I am not exactly how to do it in an easy way due to the prefixes on the names. Please let me know if you have any suggestions.

CodePudding user response:

This is an approach:

library(magrittr) # for %>% 

dirty_names  <- c(
    "Tom",
    "Tom Turner",
    "Dr. Tom Turner",
    "R. Tom Turner",
    "J Tom Turner",
    "Jr. Tom Turner"
)

dirty_names   %>% 
    # Remove first word if it ends with . e.g. Dr., Jr., R.
    sub("^\\w \\.", "", .)   %>% 
    trimws()  %>% 
    # Remove first word if it is one letter e.g. J
    sub("^[A-Za-z] ", "", .)    %>% 
    # Delete everything after first word
    sub("(\\w ).*", "\\1", .)
# [1] "Tom" "Tom" "Tom" "Tom" "Tom" "Tom"

CodePudding user response:

Solution

Here is a solution in the tidyverse, which uses regular expressions ("regex") to extract every component of interest:

  • Optional prefix: either a single letter (J), or several letters followed by a period (Jr.); separated from the ensuing name by whitespace ( ).
  • Required first_name: a "streak" of characters before the next whitespace.
  • Optional last_name: a "streak" of characters after that next whitespace.
# Load useful functions.
library(tidyverse)


# ...
# Code to generate a 'dirty_data' table with a 'dirty_name' column.
# ...


# Define the regex for extracting the name components, each within a (capture group).
dirty_regex <-
  #                      Prefix                                     Next Whitespace
  # |----------------------------------------------|                |------------|
  "^((([[:alpha:]])|([[:alpha:]] \\.))[[:blank:]] )?([^[:blank:]] )(([[:blank:]]*)(.*))?$"
  #                                                 |-------------|               |--|
  #                                                   First Name               Last Name


# Clean the 'dirty_data' and store it in a fresh table: 'clean_data'.
clean_data <- dirty_data %>%
  mutate(
    # Remove external whitespace for easier analysis.
    clean_full_name = str_trim(dirty_name),
    
    # Break the dirty names (using regex) into a matrix of their components.
    name_components = str_match(dirty_name, dirty_regex),
    
    # Extract each component.
    clean_prefix = name_components[, 2],
    clean_first_name = name_components[, 6],
    clean_last_name = name_components[, 9],
    
    # Remove the matrix.
    name_components = NULL,
    
    # Trim any external whitespace in the (new) components.
    across(starts_with("clean_") & !clean_full_name, str_trim),
    
    # Replace any empty strings ("") with blanks (NAs).
    across(starts_with("clean_"), na_if, y = "")
  )


# Print and inspect our result.
clean_data

Result

Given data like your dirty_data below

# The dirty names.
dirty_names_vec <- c("Tom", "Tom Turner", "Dr. Tom Turner", "R. Tom Turner", "J Tom Turner", "Jr. Tom Turner")

# A table with a column for the dirty names.
dirty_data <- tibble(dirty_name = dirty_names_vec)

this workflow should yield the following result for clean_data:

# A tibble: 6 × 5
  dirty_name     clean_full_name clean_prefix clean_first_name clean_last_name
  <chr>          <chr>           <chr>        <chr>            <chr>          
1 Tom            Tom             NA           Tom              NA             
2 Tom Turner     Tom Turner      NA           Tom              Turner         
3 Dr. Tom Turner Dr. Tom Turner  Dr.          Tom              Turner         
4 R. Tom Turner  R. Tom Turner   R.           Tom              Turner         
5 J Tom Turner   J Tom Turner    J            Tom              Turner         
6 Jr. Tom Turner Jr. Tom Turner  Jr.          Tom              Turner         

Note

If other "dirty" names are in different formats, you must modify your dirty_regex accordingly. You should likewise adjust the index i of each capture group, used to extract the components via clean_* = name_components[, i].

See str_match() from the stringr package, for extracting components in "capture groups". For further information on defining those groups, see regular expressions with stringr.

  •  Tags:  
  • r
  • Related