Home > database >  splitting names by & sign and duplicating rows at the same time
splitting names by & sign and duplicating rows at the same time

Time:03-29

I have names from a property database mostly organized by "LastName FirstName MI", but some have husband and wife names and go "LastName FirstName & FirstName", and I want to split that into two rows. An example I found here get's me started, but I can't figure out how to get it to work, and it's also downvoted into the negatives.

This is the data and where I'm at:

Id = c("id1", "id2", "id3", "id4", "id5", "id6", "id7")
Name = c("Berry Marry & Paul", "Horrowitz Anthony", "Lawrence Jennifer & Chris", "Jones John", "Rover Red & Clifford", "Jagger Mick", "Arthur King & Gweniverre")

df = data.frame(Id, Name)

cbind(df,t(sapply(str_split(df$Name, " & "), unlist))) %>% 
  gather(newProd, values,`1`,`2`) %>% 
  arrange(key, values)

This is what I want:

    Id              Name
1  id1       Berry Marry
2  id1        Berry Paul
3  id2 Horrowitz Anthony
4  id3 Lawrence Jennifer
5  id3    Lawrence Chris
6  id4        Jones John
7  id5         Rover Red
8  id5    Rover Clifford
9  id6       Jagger Mick
10 id7       Arthur King
11 id7 Arthur Gweniverre

Advice always appreciated, thanks!

CodePudding user response:

Here is a readable way to do it. You can then drop the LastName and FirstName columns, but I left them in to show how it works.

library(dplyr)
library(tidyr)
library(stringr)

df %>% 
  mutate(LastName = word(Name, 1),
         FirstName = word(Name, 2, -1)) %>% 
  separate_rows(FirstName, sep = " & ") %>% 
  mutate(Name = paste(LastName, FirstName))

Which gives:

# A tibble: 11 x 4
   Id    Name              LastName  FirstName 
   <chr> <chr>             <chr>     <chr>     
 1 id1   Berry Marry       Berry     Marry     
 2 id1   Berry Paul        Berry     Paul      
 3 id2   Horrowitz Anthony Horrowitz Anthony   
 4 id3   Lawrence Jennifer Lawrence  Jennifer  
 5 id3   Lawrence Chris    Lawrence  Chris     
 6 id4   Jones John        Jones     John      
 7 id5   Rover Red         Rover     Red       
 8 id5   Rover Clifford    Rover     Clifford  
 9 id6   Jagger Mick       Jagger    Mick      
10 id7   Arthur King       Arthur    King      
11 id7   Arthur Gweniverre Arthur    Gweniverre

CodePudding user response:

How about this:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
#> Warning: package 'tidyr' was built under R version 4.1.2
library(stringr)
Id = c("id1", "id2", "id3", "id4", "id5", "id6", "id7")
Name = c("Berry Marry & Paul", "Horrowitz Anthony A", "Lawrence Jennifer & Chris", "Jones John", "Rover Red & Clifford", "Jagger Mick", "Arthur King & Gweniverre")

df = data.frame(Id, Name)

df %>% 
  mutate(Name = gsub(" & ", "&", Name, fixed=TRUE), 
         Name = gsub("(.*)\\s([A-Z])$", "\\1_\\2", Name)) %>% 
  separate(Name, "\\s", into = c("last", "first")) %>% 
  mutate(first = str_split(first, "&")) %>% 
  unnest(first) %>% 
  mutate(first = gsub("_", " ", first))
#> # A tibble: 11 × 3
#>    Id    last      first     
#>    <chr> <chr>     <chr>     
#>  1 id1   Berry     Marry     
#>  2 id1   Berry     Paul      
#>  3 id2   Horrowitz Anthony A 
#>  4 id3   Lawrence  Jennifer  
#>  5 id3   Lawrence  Chris     
#>  6 id4   Jones     John      
#>  7 id5   Rover     Red       
#>  8 id5   Rover     Clifford  
#>  9 id6   Jagger    Mick      
#> 10 id7   Arthur    King      
#> 11 id7   Arthur    Gweniverre

The code replaces " & " with "&", so the only spaces left are between the last name and the first, first name. Then it splits the last name from the first names. Finally, it splits the first names into a list and then uses unnest to turn each element into a row.

Created on 2022-03-28 by the reprex package (v2.0.1)

CodePudding user response:

Using a by approach similar to this one.

do.call(rbind, 
        c(by(df, df$Id, \(x) 
             with(x, data.frame(Id, Name={
               if (grepl('&', x$Name)) {
               k <- el(strsplit(x$Name, ' '))
               paste(k[1], k[c(2, 4)])
               } else x$Name
             }))),
          make.row.names=FALSE))
#     Id              Name
# 1  id1       Berry Marry
# 2  id1        Berry Paul
# 3  id2 Horrowitz Anthony
# 4  id3 Lawrence Jennifer
# 5  id3    Lawrence Chris
# 6  id4        Jones John
# 7  id5         Rover Red
# 8  id5    Rover Clifford
# 9  id6       Jagger Mick
# 10 id7       Arthur King
# 11 id7 Arthur Gweniverre

Note: R >= 4.1 used.


Data:

df <- structure(list(Id = c("id1", "id2", "id3", "id4", "id5", "id6", 
"id7"), Name = c("Berry Marry & Paul", "Horrowitz Anthony", "Lawrence Jennifer & Chris", 
"Jones John", "Rover Red & Clifford", "Jagger Mick", "Arthur King & Gweniverre"
)), class = "data.frame", row.names = c(NA, -7L))

CodePudding user response:

We may use separate_rows to split at the & delimiter, the do a group_by paste

library(dplyr)
library(stringr)
library(tidyr)
df %>% 
   separate_rows(Name, sep = "\\s \\&\\s ") %>% 
   group_by(Id) %>% 
    mutate(Name = replace(Name, -1,
       str_c(word(first(Name), 1), Name[-1], sep = " "))) %>%
   ungroup

-output

# A tibble: 11 × 2
   Id    Name             
   <chr> <chr>            
 1 id1   Berry Marry      
 2 id1   Berry Paul       
 3 id2   Horrowitz Anthony
 4 id3   Lawrence Jennifer
 5 id3   Lawrence Chris   
 6 id4   Jones John       
 7 id5   Rover Red        
 8 id5   Rover Clifford   
 9 id6   Jagger Mick      
10 id7   Arthur King      
11 id7   Arthur Gweniverre

Or slightly more compact

df %>% 
   mutate(Name = str_replace(Name, "^(\\w )([^&] )&", "\\1\\2,\\1")) %>% 
   separate_rows(Name, sep = "\\s ,\\s*")
# A tibble: 11 × 2
   Id    Name             
   <chr> <chr>            
 1 id1   Berry Marry      
 2 id1   Berry Paul       
 3 id2   Horrowitz Anthony
 4 id3   Lawrence Jennifer
 5 id3   Lawrence Chris   
 6 id4   Jones John       
 7 id5   Rover Red        
 8 id5   Rover Clifford   
 9 id6   Jagger Mick      
10 id7   Arthur King      
11 id7   Arthur Gweniverre
  •  Tags:  
  • r
  • Related