How to change the format of the output obtained by base R' cut?-CodePudding

I have a dataframe with ages and my objective is to split those ages into groups.

I found the way to do it, that is using cut from base R. However, I would like to change the format of the output. I saw one way in this post but as someone says, I lose information contained in the original notation.

The code:

set.seed(1234)
Age <- floor(runif(25, min=18, max=75))

df <- data.frame(Age)
mybreaks <- seq(min(df$Age)-1, to=max(df$Age) 10, by=10)
df$groups_age <- cut(df$Age, breaks = mybreaks, by=10)


label_interval <- function(breaks) {
  paste0("(", breaks[1:length(breaks) - 1], "-", breaks[2:length(breaks)], ")")
}
df$groups_age_2 <- cut(df$Age, breaks = mybreaks, labels = label_interval(mybreaks))


df[7:12, ]
#    Age groups_age groups_age_2
# 7   18    [17,27)      (17-27)
# 8   31    [27,37)      (27-37)
# 9   55    [47,57)      (47-57)
# 10  47    [47,57)      (37-47)
# 11  57    [57,67)      (47-57)
# 12  49    [47,57)      (47-57)

As you can see, the column "group_age" is the output from cut and the column "group_age2" is the result from using the function label_interval (which someone wrote in the previous post).

It could be a good solution, but I lose information.

For example, in rows 7 and 8, we see 2 different ages which have different groups. Both contain 27, which is not correct.

I would like to have row 8 like (28-37) but I don't know how to add it in the function, in order to do it with the entire dataframe.

Does anyone know how to do it?

CodePudding user response：

Try redefine your function like this. It uses sprintf instead of paste, which might be easier in cases where hardcoded and dynamic strings are concatenated.

label_interval <- function(breaks) {
  do.call(\(...) sprintf('(%s-%s)', ...),
          cbind.data.frame(breaks[-length(breaks)]   1, breaks[-1]))
}

df$groups_age_2 <- cut(df$Age, breaks=mybreaks, labels=label_interval(mybreaks))

df[order(df$Age), ]
#    Age groups_age groups_age_2
# 7   18    (17,27]      (18-27)
# 24  20    (17,27]      (18-27)
# 1   24    (17,27]      (18-27)
# 23  27    (17,27]      (18-27)
# 19  28    (27,37]      (28-37)
# 25  30    (27,37]      (28-37)
# 8   31    (27,37]      (28-37)
# 20  31    (27,37]      (28-37)
# 18  33    (27,37]      (28-37)
# 13  34    (27,37]      (28-37)
# 15  34    (27,37]      (28-37)
# 17  34    (27,37]      (28-37)
# 22  35    (27,37]      (28-37)
# 21  36    (27,37]      (28-37)
# 10  47    (37,47]      (38-47)
# 12  49    (47,57]      (48-57)
# 3   52    (47,57]      (48-57)
# 2   53    (47,57]      (48-57)
# 4   53    (47,57]      (48-57)
# 6   54    (47,57]      (48-57)
# 9   55    (47,57]      (48-57)
# 11  57    (47,57]      (48-57)
# 16  65    (57,67]      (58-67)
# 5   67    (57,67]      (58-67)
# 14  70    (67,77]      (68-77)

Data:

df <- structure(list(Age = c(24, 53, 52, 53, 67, 54, 18, 31, 55, 47, 
57, 49, 34, 70, 34, 65, 34, 33, 28, 31, 36, 35, 27, 20, 30), 
    groups_age = structure(c(1L, 4L, 4L, 4L, 5L, 4L, 1L, 2L, 
    4L, 3L, 4L, 4L, 2L, 6L, 2L, 5L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 
    1L, 2L), .Label = c("(17,27]", "(27,37]", "(37,47]", "(47,57]", 
    "(57,67]", "(67,77]"), class = "factor")), row.names = c(NA, 
-25L), class = "data.frame")
mybreaks <- seq(min(df$Age) - 1, to=max(df$Age)   10, by=10)

CodePudding user response：

cut calls .bincode which returns integer format for the groups, where 1 is the lowest, then 2 etc. So you could do the formatting manually and then map them to the cluster index e.g.

format_groups <- sapply(1:(length(mybreaks)-1), function(i){
  # to avoid adding 1 to the first cluster:
  if(i>1){
    b1 <- mybreaks[i] 1
  }else{
    b1 <- mybreaks[i]
  } 
  sprintf("(%s-%s)", b1, mybreaks[i 1])
})
df$group_index <- .bincode(df$Age, breaks = mybreaks)
df$groups_age_2 <- format_groups[df$group_index]