I need to extract the last names of several thousand people. The names are either two or three words long, depending on whether there is a suffix or not. My attack is to count the number of words in each row, then execute a different separate()
function depending on how many words there are. The following code does not work but shows my thinking:
customers = data.frame(names=c("Jack Quinn III", "David Powell", "Carrie Green",
"Steven Miller, Jr.", "Christine Powers", "Amanda Ramirez"))
customers |>
mutate(names_count = str_count(names, "\\w ")) |>
{
if(names_count == 2,
separate(name, c("first_name", "last_name") ),
separate(name, c("first_name", "last_name", "suffix") )
)
}
This code cannot possibly work because I'm missing the ability to interpret the error messages. In fact, I'm not sure if the commas are needed in the if
statement because there are apparently functions that use both.
My thought was that I could get the names split into columns by doing
df |>
mutate() to count words |>
separate() to split columns based on count
but I can't get even the simplest if statement to work.
CodePudding user response:
We could use word
from stringr
instead:
library(stringr)
library(dplyr)
customers |>
mutate(last_name = word(names, 2))
Output:
names last_name
1 Jack Quinn III Quinn
2 David Powell Powell
3 Carrie Green Green
4 Steven Miller, Jr. Miller,
5 Christine Powers Powers
6 Amanda Ramirez Ramirez
CodePudding user response:
Using str_extract
library(dplyr)
library(stringr)
customers %>%
mutate(last_name = str_extract(names, "^[A-Za-z] \\s ([A-Za-z] )", group = 1))
-output
names last_name
1 Jack Quinn III Quinn
2 David Powell Powell
3 Carrie Green Green
4 Steven Miller, Jr. Miller
5 Christine Powers Powers
6 Amanda Ramirez Ramirez
CodePudding user response:
You can remove if
customers %>%
separate(names, into = c("first_name", "last_name", "suffix"), sep=" ") %>%
select(last_name)
If you want to avoid extra packages, you can use R base sub
regex:
> sub("[A-Za-z] \\s ([A-Za-z] )\\s?.*", "\\1", customers$names)
[1] "Quinn" "Powell" "Green" "Miller" "Powers" "Ramirez"