Home > Software engineering >  What is the quickest way to create new column of all unique last names for a given first name in Bas
What is the quickest way to create new column of all unique last names for a given first name in Bas

Time:06-14

I have a dataset similar to the following (but larger):

dataset <- data.frame(First = c("John","John","Andy","John"), Last = c("Lewis","Brown","Alphie","Johnson"))

I would like to create a new column that contains each unique last name cooresponding to the given first name. Thus, each observation of "John" would have c("Lewis", "Brown", "Johnson") in the third column.

I'm a bit perplexed because my attempts at vectorization seem impossible given I can't reference the particular observation I'm looking at. Specifically, what I want to write is:

dataset$allLastNames <- unique(data$Last[data$First == "the current index???"])

I think this can work in a loop (since I reference the observation with 'i'), but it is taking too long given the size of my data:

for(i in 1:nrow(dataset)){
dataset$allLastNames[i] <- unique(dataset$Last[dataset$First == dataset$First[i]])
}

Any suggestions for how I could make this work (using Base R)?

Thanks!

CodePudding user response:

You can use dplyr library with a few lines. First, you can group by first names and list all unique last names occurences.

library(dplyr)

list_names = dataset %>% 
  group_by(First) %>% 
  summarise(allLastNames = list(unique(Last)))

Then, add the summary table to your dataset matching the First names:

dataset %>% left_join(list_names,by='First')

  First    Last          allLastNames
1  John   Lewis Lewis, Brown, Johnson
2  John   Brown Lewis, Brown, Johnson
3  Andy  Alphie                Alphie
4  John Johnson Lewis, Brown, Johnson

Also, I think R is a good language to avoid using for-loops. You have several methods to work with dataset and arrays avoiding them.

CodePudding user response:

Base R option:

allLastNames <- aggregate(.~First, dataset, paste, collapse = ",")
dataset <- merge(dataset, allLastNames, by = "First")
names(dataset) <- c("First", "Last", "allLastNames")

Output:

  First    Last        allLastNames
1  Andy  Alphie              Alphie
2  John   Lewis Lewis,Brown,Johnson
3  John   Brown Lewis,Brown,Johnson
4  John Johnson Lewis,Brown,Johnson

CodePudding user response:

library(dplyr) 
library(stringr)

dataset %>%  
  group_by(First) %>%  
  mutate(Lastnames = str_flatten(Last, ', ')) 

# Groups:   First [2]
  First Last    Lastnames            
  <chr> <chr>   <chr>                
1 John  Lewis   Lewis, Brown, Johnson
2 John  Brown   Lewis, Brown, Johnson
3 Andy  Alphie  Alphie               
4 John  Johnson Lewis, Brown, Johnson
  •  Tags:  
  • r
  • Related