I have a dataset similar to the following (but larger):
dataset <- data.frame(First = c("John","John","Andy","John"), Last = c("Lewis","Brown","Alphie","Johnson"))
I would like to create a new column that contains each unique last name cooresponding to the given first name. Thus, each observation of "John" would have c("Lewis", "Brown", "Johnson") in the third column.
I'm a bit perplexed because my attempts at vectorization seem impossible given I can't reference the particular observation I'm looking at. Specifically, what I want to write is:
dataset$allLastNames <- unique(data$Last[data$First == "the current index???"])
I think this can work in a loop (since I reference the observation with 'i'), but it is taking too long given the size of my data:
for(i in 1:nrow(dataset)){
dataset$allLastNames[i] <- unique(dataset$Last[dataset$First == dataset$First[i]])
}
Any suggestions for how I could make this work (using Base R)?
Thanks!
CodePudding user response:
You can use dplyr
library with a few lines. First, you can group by first names and list all unique last names occurences.
library(dplyr)
list_names = dataset %>%
group_by(First) %>%
summarise(allLastNames = list(unique(Last)))
Then, add the summary table to your dataset matching the First names:
dataset %>% left_join(list_names,by='First')
First Last allLastNames
1 John Lewis Lewis, Brown, Johnson
2 John Brown Lewis, Brown, Johnson
3 Andy Alphie Alphie
4 John Johnson Lewis, Brown, Johnson
Also, I think R is a good language to avoid using for-loops. You have several methods to work with dataset and arrays avoiding them.
CodePudding user response:
Base R
option:
allLastNames <- aggregate(.~First, dataset, paste, collapse = ",")
dataset <- merge(dataset, allLastNames, by = "First")
names(dataset) <- c("First", "Last", "allLastNames")
Output:
First Last allLastNames
1 Andy Alphie Alphie
2 John Lewis Lewis,Brown,Johnson
3 John Brown Lewis,Brown,Johnson
4 John Johnson Lewis,Brown,Johnson
CodePudding user response:
library(dplyr)
library(stringr)
dataset %>%
group_by(First) %>%
mutate(Lastnames = str_flatten(Last, ', '))
# Groups: First [2]
First Last Lastnames
<chr> <chr> <chr>
1 John Lewis Lewis, Brown, Johnson
2 John Brown Lewis, Brown, Johnson
3 Andy Alphie Alphie
4 John Johnson Lewis, Brown, Johnson