I have names from a property database mostly organized by "LastName FirstName MI", but some have husband and wife names and go "LastName FirstName & FirstName", and I want to split that into two rows. An example I found here get's me started, but I can't figure out how to get it to work, and it's also downvoted into the negatives.
This is the data and where I'm at:
Id = c("id1", "id2", "id3", "id4", "id5", "id6", "id7")
Name = c("Berry Marry & Paul", "Horrowitz Anthony", "Lawrence Jennifer & Chris", "Jones John", "Rover Red & Clifford", "Jagger Mick", "Arthur King & Gweniverre")
df = data.frame(Id, Name)
cbind(df,t(sapply(str_split(df$Name, " & "), unlist))) %>%
gather(newProd, values,`1`,`2`) %>%
arrange(key, values)
This is what I want:
Id Name
1 id1 Berry Marry
2 id1 Berry Paul
3 id2 Horrowitz Anthony
4 id3 Lawrence Jennifer
5 id3 Lawrence Chris
6 id4 Jones John
7 id5 Rover Red
8 id5 Rover Clifford
9 id6 Jagger Mick
10 id7 Arthur King
11 id7 Arthur Gweniverre
Advice always appreciated, thanks!
CodePudding user response:
Here is a readable way to do it. You can then drop the LastName
and FirstName
columns, but I left them in to show how it works.
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(LastName = word(Name, 1),
FirstName = word(Name, 2, -1)) %>%
separate_rows(FirstName, sep = " & ") %>%
mutate(Name = paste(LastName, FirstName))
Which gives:
# A tibble: 11 x 4
Id Name LastName FirstName
<chr> <chr> <chr> <chr>
1 id1 Berry Marry Berry Marry
2 id1 Berry Paul Berry Paul
3 id2 Horrowitz Anthony Horrowitz Anthony
4 id3 Lawrence Jennifer Lawrence Jennifer
5 id3 Lawrence Chris Lawrence Chris
6 id4 Jones John Jones John
7 id5 Rover Red Rover Red
8 id5 Rover Clifford Rover Clifford
9 id6 Jagger Mick Jagger Mick
10 id7 Arthur King Arthur King
11 id7 Arthur Gweniverre Arthur Gweniverre
CodePudding user response:
How about this:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
#> Warning: package 'tidyr' was built under R version 4.1.2
library(stringr)
Id = c("id1", "id2", "id3", "id4", "id5", "id6", "id7")
Name = c("Berry Marry & Paul", "Horrowitz Anthony A", "Lawrence Jennifer & Chris", "Jones John", "Rover Red & Clifford", "Jagger Mick", "Arthur King & Gweniverre")
df = data.frame(Id, Name)
df %>%
mutate(Name = gsub(" & ", "&", Name, fixed=TRUE),
Name = gsub("(.*)\\s([A-Z])$", "\\1_\\2", Name)) %>%
separate(Name, "\\s", into = c("last", "first")) %>%
mutate(first = str_split(first, "&")) %>%
unnest(first) %>%
mutate(first = gsub("_", " ", first))
#> # A tibble: 11 × 3
#> Id last first
#> <chr> <chr> <chr>
#> 1 id1 Berry Marry
#> 2 id1 Berry Paul
#> 3 id2 Horrowitz Anthony A
#> 4 id3 Lawrence Jennifer
#> 5 id3 Lawrence Chris
#> 6 id4 Jones John
#> 7 id5 Rover Red
#> 8 id5 Rover Clifford
#> 9 id6 Jagger Mick
#> 10 id7 Arthur King
#> 11 id7 Arthur Gweniverre
The code replaces " & "
with "&"
, so the only spaces left are between the last name and the first, first name. Then it splits the last name from the first names. Finally, it splits the first names into a list and then uses unnest
to turn each element into a row.
Created on 2022-03-28 by the reprex package (v2.0.1)
CodePudding user response:
Using a by
approach similar to this one.
do.call(rbind,
c(by(df, df$Id, \(x)
with(x, data.frame(Id, Name={
if (grepl('&', x$Name)) {
k <- el(strsplit(x$Name, ' '))
paste(k[1], k[c(2, 4)])
} else x$Name
}))),
make.row.names=FALSE))
# Id Name
# 1 id1 Berry Marry
# 2 id1 Berry Paul
# 3 id2 Horrowitz Anthony
# 4 id3 Lawrence Jennifer
# 5 id3 Lawrence Chris
# 6 id4 Jones John
# 7 id5 Rover Red
# 8 id5 Rover Clifford
# 9 id6 Jagger Mick
# 10 id7 Arthur King
# 11 id7 Arthur Gweniverre
Note: R >= 4.1 used.
Data:
df <- structure(list(Id = c("id1", "id2", "id3", "id4", "id5", "id6",
"id7"), Name = c("Berry Marry & Paul", "Horrowitz Anthony", "Lawrence Jennifer & Chris",
"Jones John", "Rover Red & Clifford", "Jagger Mick", "Arthur King & Gweniverre"
)), class = "data.frame", row.names = c(NA, -7L))
CodePudding user response:
We may use separate_rows
to split at the &
delimiter, the do a group_by
paste
library(dplyr)
library(stringr)
library(tidyr)
df %>%
separate_rows(Name, sep = "\\s \\&\\s ") %>%
group_by(Id) %>%
mutate(Name = replace(Name, -1,
str_c(word(first(Name), 1), Name[-1], sep = " "))) %>%
ungroup
-output
# A tibble: 11 × 2
Id Name
<chr> <chr>
1 id1 Berry Marry
2 id1 Berry Paul
3 id2 Horrowitz Anthony
4 id3 Lawrence Jennifer
5 id3 Lawrence Chris
6 id4 Jones John
7 id5 Rover Red
8 id5 Rover Clifford
9 id6 Jagger Mick
10 id7 Arthur King
11 id7 Arthur Gweniverre
Or slightly more compact
df %>%
mutate(Name = str_replace(Name, "^(\\w )([^&] )&", "\\1\\2,\\1")) %>%
separate_rows(Name, sep = "\\s ,\\s*")
# A tibble: 11 × 2
Id Name
<chr> <chr>
1 id1 Berry Marry
2 id1 Berry Paul
3 id2 Horrowitz Anthony
4 id3 Lawrence Jennifer
5 id3 Lawrence Chris
6 id4 Jones John
7 id5 Rover Red
8 id5 Rover Clifford
9 id6 Jagger Mick
10 id7 Arthur King
11 id7 Arthur Gweniverre