I have a dataset with over 600k obs in R. I created a "key" variable and I want to know how many different keys appear related to an ID variable. My dataset looks something like this:
id <- c("58497484", "58544005", "74766653", "74764718", "62824455", "58497484", "58497484")
key <- c("5718368_09/06/1981_3_2014", "2077485_02/06/1977_8_2014", "2091585_23/10/1982_1_2014", "2077388_30/01/2000_11_2017", "2082225_02/07/1998_10_2017", "2077450_04/05/2001_1_2016", "2077477_03/03/1978_8_2017")
data <- as.data.frame(cbind(id, key))
So, as you can see, the id "58497484" repeats itself 3 times, but shows 3 different keys. What I want is to list each id and it's respective key in a table, so it would look something like this:
id key
58497484 5718368_09/06/1981_3_2014
2077450_04/05/2001_1_2016
2077477_03/03/1978_8_2017
58544005 2077485_02/06/1977_8_2014
74766653 2091585_23/10/1982_1_2014
74764718 2077388_30/01/2000_11_2017
62824455 2082225_02/07/1998_10_2017
Any ideas how can I make this work? I've already tried dplyr and group_by using the two variables, but I don't get the desired output.
CodePudding user response:
We could first arrange and then replace the duplicates by blank cells:
library(dplyr)
data %>%
arrange(id) %>%
mutate(id = ifelse(duplicated(id), "", id))
id key
1 58497484 5718368_09/06/1981_3_2014
2 2077450_04/05/2001_1_2016
3 2077477_03/03/1978_8_2017
4 58544005 2077485_02/06/1977_8_2014
5 62824455 2082225_02/07/1998_10_2017
6 74764718 2077388_30/01/2000_11_2017
7 74766653 2091585_23/10/1982_1_2014
CodePudding user response:
Here is a base R solution too.
data <- data[order(id, as.integer(sub('.*(\\d{4}).*', '\\1', key))),]
data$id[duplicated(data$id) == TRUE] <- ""
# Reset row names/numbers.
rownames(data) <- NULL
Output
id key
1 58497484 5718368_09/06/1981_3_2014
2 2077450_04/05/2001_1_2016
3 2077477_03/03/1978_8_2017
4 58544005 2077485_02/06/1977_8_2014
5 62824455 2082225_02/07/1998_10_2017
6 74764718 2077388_30/01/2000_11_2017
7 74766653 2091585_23/10/1982_1_2014