I am trying to remove suffixes from a list of last names using regex:
names <- c("John max Jr.", "manuel cortez", "samuel III", "Jameson")
lapply(names, function(x) str_extract(x, ".*[^\\s.*\\.$]"))
Output:
[1] "John max Jr"
[[2]]
[1] "manuel cortez"
[[3]]
[1] "samuel III"
[[4]]
[1] "Jameson"
What I am currently doing, does not work.... I was trying to remove all words that end with a period. If you could please help me solve this and explain, it would be greatly appreciated. I also need to remove roman numerals but hopefully I can figure that out after learning to remove words ending in period.
Desired Output:
John max
manuel cortez
samuel
Jameson
Updated to remove Roman Numerals:
lapply(names, function(x) str_extract(x, ".*[^(\\s.*\\.$)|(\\sI{2} )]"))
CodePudding user response:
If we just want to remove
something, maybe str_remove()
is better:
library(stringr)
lapply(names, function(x) str_remove(x, "\\w \\.$")) |>
trimws()
"John max" "manuel cortez" "samuel III" "Jameson"
CodePudding user response:
You can use
library(stringr)
names <- c("John max Jr.", "manuel cortez", "samuel III", "Jameson")
str_replace_all(names, "\\s*\\p{L} \\.", "")
# str_remove_all(names, "\\s*\\p{L} \\.")
# Or,
str_replace_all(names, "\\s*\\w \\.", "")
# str_remove_all(names, "\\s*\\w \\.")
See the R demo online and the regex demo.
Details:
\s*
- zero or more whitespaces\p{L}
- one or more letters\w
- one or more letters, digits, and connector punctuation\.
- a dot.