I have a dataframe with ages and my objective is to split those ages into groups.
I found the way to do it, that is using cut
from base R. However, I would like to change the format of the output. I saw one way in this post but as someone says, I lose information contained in the original notation.
The code:
set.seed(1234)
Age <- floor(runif(25, min=18, max=75))
df <- data.frame(Age)
mybreaks <- seq(min(df$Age)-1, to=max(df$Age) 10, by=10)
df$groups_age <- cut(df$Age, breaks = mybreaks, by=10)
label_interval <- function(breaks) {
paste0("(", breaks[1:length(breaks) - 1], "-", breaks[2:length(breaks)], ")")
}
df$groups_age_2 <- cut(df$Age, breaks = mybreaks, labels = label_interval(mybreaks))
df[7:12, ]
# Age groups_age groups_age_2
# 7 18 [17,27) (17-27)
# 8 31 [27,37) (27-37)
# 9 55 [47,57) (47-57)
# 10 47 [47,57) (37-47)
# 11 57 [57,67) (47-57)
# 12 49 [47,57) (47-57)
As you can see, the column "group_age" is the output from cut
and the column "group_age2" is the result from using the function label_interval (which someone wrote in the previous post).
It could be a good solution, but I lose information.
For example, in rows 7 and 8, we see 2 different ages which have different groups. Both contain 27, which is not correct.
I would like to have row 8 like (28-37)
but I don't know how to add it in the function, in order to do it with the entire dataframe.
Does anyone know how to do it?
CodePudding user response:
Try redefine your function like this. It uses sprintf
instead of paste
, which might be easier in cases where hardcoded and dynamic strings are concatenated.
label_interval <- function(breaks) {
do.call(\(...) sprintf('(%s-%s)', ...),
cbind.data.frame(breaks[-length(breaks)] 1, breaks[-1]))
}
df$groups_age_2 <- cut(df$Age, breaks=mybreaks, labels=label_interval(mybreaks))
df[order(df$Age), ]
# Age groups_age groups_age_2
# 7 18 (17,27] (18-27)
# 24 20 (17,27] (18-27)
# 1 24 (17,27] (18-27)
# 23 27 (17,27] (18-27)
# 19 28 (27,37] (28-37)
# 25 30 (27,37] (28-37)
# 8 31 (27,37] (28-37)
# 20 31 (27,37] (28-37)
# 18 33 (27,37] (28-37)
# 13 34 (27,37] (28-37)
# 15 34 (27,37] (28-37)
# 17 34 (27,37] (28-37)
# 22 35 (27,37] (28-37)
# 21 36 (27,37] (28-37)
# 10 47 (37,47] (38-47)
# 12 49 (47,57] (48-57)
# 3 52 (47,57] (48-57)
# 2 53 (47,57] (48-57)
# 4 53 (47,57] (48-57)
# 6 54 (47,57] (48-57)
# 9 55 (47,57] (48-57)
# 11 57 (47,57] (48-57)
# 16 65 (57,67] (58-67)
# 5 67 (57,67] (58-67)
# 14 70 (67,77] (68-77)
Data:
df <- structure(list(Age = c(24, 53, 52, 53, 67, 54, 18, 31, 55, 47,
57, 49, 34, 70, 34, 65, 34, 33, 28, 31, 36, 35, 27, 20, 30),
groups_age = structure(c(1L, 4L, 4L, 4L, 5L, 4L, 1L, 2L,
4L, 3L, 4L, 4L, 2L, 6L, 2L, 5L, 2L, 2L, 2L, 2L, 2L, 2L, 1L,
1L, 2L), .Label = c("(17,27]", "(27,37]", "(37,47]", "(47,57]",
"(57,67]", "(67,77]"), class = "factor")), row.names = c(NA,
-25L), class = "data.frame")
mybreaks <- seq(min(df$Age) - 1, to=max(df$Age) 10, by=10)
CodePudding user response:
cut
calls .bincode
which returns integer format for the groups, where 1 is the lowest, then 2 etc. So you could do the formatting manually and then map them to the cluster index e.g.
format_groups <- sapply(1:(length(mybreaks)-1), function(i){
# to avoid adding 1 to the first cluster:
if(i>1){
b1 <- mybreaks[i] 1
}else{
b1 <- mybreaks[i]
}
sprintf("(%s-%s)", b1, mybreaks[i 1])
})
df$group_index <- .bincode(df$Age, breaks = mybreaks)
df$groups_age_2 <- format_groups[df$group_index]