Hi I have a column with names. It has names in the format of:
c("Tom", "Tom Turner", "Dr. Tom Turner", "R. Tom Turner", "J Tom Turner", "Jr. Tom Turner")
.
I just want to extract the first name but I am not exactly how to do it in an easy way due to the prefixes on the names. Please let me know if you have any suggestions.
CodePudding user response:
This is an approach:
library(magrittr) # for %>%
dirty_names <- c(
"Tom",
"Tom Turner",
"Dr. Tom Turner",
"R. Tom Turner",
"J Tom Turner",
"Jr. Tom Turner"
)
dirty_names %>%
# Remove first word if it ends with . e.g. Dr., Jr., R.
sub("^\\w \\.", "", .) %>%
trimws() %>%
# Remove first word if it is one letter e.g. J
sub("^[A-Za-z] ", "", .) %>%
# Delete everything after first word
sub("(\\w ).*", "\\1", .)
# [1] "Tom" "Tom" "Tom" "Tom" "Tom" "Tom"
CodePudding user response:
Solution
Here is a solution in the tidyverse
, which uses regular expressions ("regex") to extract every component of interest:
- Optional
prefix
: either a single letter (J
), or several letters followed by a period (Jr.
); separated from the ensuing name by whitespace ( - Required
first_name
: a "streak" of characters before the next whitespace. - Optional
last_name
: a "streak" of characters after that next whitespace.
# Load useful functions.
library(tidyverse)
# ...
# Code to generate a 'dirty_data' table with a 'dirty_name' column.
# ...
# Define the regex for extracting the name components, each within a (capture group).
dirty_regex <-
# Prefix Next Whitespace
# |----------------------------------------------| |------------|
"^((([[:alpha:]])|([[:alpha:]] \\.))[[:blank:]] )?([^[:blank:]] )(([[:blank:]]*)(.*))?$"
# |-------------| |--|
# First Name Last Name
# Clean the 'dirty_data' and store it in a fresh table: 'clean_data'.
clean_data <- dirty_data %>%
mutate(
# Remove external whitespace for easier analysis.
clean_full_name = str_trim(dirty_name),
# Break the dirty names (using regex) into a matrix of their components.
name_components = str_match(dirty_name, dirty_regex),
# Extract each component.
clean_prefix = name_components[, 2],
clean_first_name = name_components[, 6],
clean_last_name = name_components[, 9],
# Remove the matrix.
name_components = NULL,
# Trim any external whitespace in the (new) components.
across(starts_with("clean_") & !clean_full_name, str_trim),
# Replace any empty strings ("") with blanks (NAs).
across(starts_with("clean_"), na_if, y = "")
)
# Print and inspect our result.
clean_data
Result
Given data like your dirty_data
below
# The dirty names.
dirty_names_vec <- c("Tom", "Tom Turner", "Dr. Tom Turner", "R. Tom Turner", "J Tom Turner", "Jr. Tom Turner")
# A table with a column for the dirty names.
dirty_data <- tibble(dirty_name = dirty_names_vec)
this workflow should yield the following result for clean_data
:
# A tibble: 6 × 5
dirty_name clean_full_name clean_prefix clean_first_name clean_last_name
<chr> <chr> <chr> <chr> <chr>
1 Tom Tom NA Tom NA
2 Tom Turner Tom Turner NA Tom Turner
3 Dr. Tom Turner Dr. Tom Turner Dr. Tom Turner
4 R. Tom Turner R. Tom Turner R. Tom Turner
5 J Tom Turner J Tom Turner J Tom Turner
6 Jr. Tom Turner Jr. Tom Turner Jr. Tom Turner
Note
If other "dirty" names are in different formats, you must modify your dirty_regex
accordingly. You should likewise adjust the index i
of each capture group, used to extract the components via clean_* = name_components[, i]
.
See str_match()
from the stringr
package, for extracting components in "capture groups". For further information on defining those groups, see regular expressions with stringr
.