Home > front end >  How to remove only words that end with period with Regex?
How to remove only words that end with period with Regex?

Time:10-07

I am trying to remove suffixes from a list of last names using regex:

names <- c("John max Jr.", "manuel cortez", "samuel III", "Jameson")
lapply(names, function(x) str_extract(x, ".*[^\\s.*\\.$]"))

Output:

[1] "John max Jr"

[[2]]
[1] "manuel cortez"

[[3]]
[1] "samuel III"

[[4]]
[1] "Jameson"

What I am currently doing, does not work.... I was trying to remove all words that end with a period. If you could please help me solve this and explain, it would be greatly appreciated. I also need to remove roman numerals but hopefully I can figure that out after learning to remove words ending in period.

Desired Output:

John max
manuel cortez
samuel
Jameson

Updated to remove Roman Numerals:

lapply(names, function(x) str_extract(x, ".*[^(\\s.*\\.$)|(\\sI{2} )]"))

CodePudding user response:

If we just want to remove something, maybe str_remove() is better:

library(stringr)

lapply(names, function(x) str_remove(x, "\\w \\.$")) |>
    trimws()

"John max"      "manuel cortez" "samuel III"    "Jameson"     

CodePudding user response:

You can use

library(stringr)
names <- c("John max Jr.", "manuel cortez", "samuel III", "Jameson")
str_replace_all(names, "\\s*\\p{L} \\.", "")
# str_remove_all(names, "\\s*\\p{L} \\.")
# Or,
str_replace_all(names, "\\s*\\w \\.", "")
# str_remove_all(names, "\\s*\\w \\.")

See the R demo online and the regex demo.

Details:

  • \s* - zero or more whitespaces
  • \p{L} - one or more letters
  • \w - one or more letters, digits, and connector punctuation
  • \. - a dot.
  • Related