Home > Mobile >  How to change the format of the output obtained by base R' cut?
How to change the format of the output obtained by base R' cut?

Time:12-22

I have a dataframe with ages and my objective is to split those ages into groups.

I found the way to do it, that is using cut from base R. However, I would like to change the format of the output. I saw one way in this post but as someone says, I lose information contained in the original notation.

The code:

set.seed(1234)
Age <- floor(runif(25, min=18, max=75))

df <- data.frame(Age)
mybreaks <- seq(min(df$Age)-1, to=max(df$Age) 10, by=10)
df$groups_age <- cut(df$Age, breaks = mybreaks, by=10)


label_interval <- function(breaks) {
  paste0("(", breaks[1:length(breaks) - 1], "-", breaks[2:length(breaks)], ")")
}
df$groups_age_2 <- cut(df$Age, breaks = mybreaks, labels = label_interval(mybreaks))


df[7:12, ]
#    Age groups_age groups_age_2
# 7   18    [17,27)      (17-27)
# 8   31    [27,37)      (27-37)
# 9   55    [47,57)      (47-57)
# 10  47    [47,57)      (37-47)
# 11  57    [57,67)      (47-57)
# 12  49    [47,57)      (47-57)

As you can see, the column "group_age" is the output from cut and the column "group_age2" is the result from using the function label_interval (which someone wrote in the previous post).

It could be a good solution, but I lose information.

For example, in rows 7 and 8, we see 2 different ages which have different groups. Both contain 27, which is not correct.

I would like to have row 8 like (28-37) but I don't know how to add it in the function, in order to do it with the entire dataframe.

Does anyone know how to do it?

CodePudding user response:

Try redefine your function like this. It uses sprintf instead of paste, which might be easier in cases where hardcoded and dynamic strings are concatenated.

label_interval <- function(breaks) {
  do.call(\(...) sprintf('(%s-%s)', ...),
          cbind.data.frame(breaks[-length(breaks)]   1, breaks[-1]))
}

df$groups_age_2 <- cut(df$Age, breaks=mybreaks, labels=label_interval(mybreaks))

df[order(df$Age), ]
#    Age groups_age groups_age_2
# 7   18    (17,27]      (18-27)
# 24  20    (17,27]      (18-27)
# 1   24    (17,27]      (18-27)
# 23  27    (17,27]      (18-27)
# 19  28    (27,37]      (28-37)
# 25  30    (27,37]      (28-37)
# 8   31    (27,37]      (28-37)
# 20  31    (27,37]      (28-37)
# 18  33    (27,37]      (28-37)
# 13  34    (27,37]      (28-37)
# 15  34    (27,37]      (28-37)
# 17  34    (27,37]      (28-37)
# 22  35    (27,37]      (28-37)
# 21  36    (27,37]      (28-37)
# 10  47    (37,47]      (38-47)
# 12  49    (47,57]      (48-57)
# 3   52    (47,57]      (48-57)
# 2   53    (47,57]      (48-57)
# 4   53    (47,57]      (48-57)
# 6   54    (47,57]      (48-57)
# 9   55    (47,57]      (48-57)
# 11  57    (47,57]      (48-57)
# 16  65    (57,67]      (58-67)
# 5   67    (57,67]      (58-67)
# 14  70    (67,77]      (68-77)

Data:

df <- structure(list(Age = c(24, 53, 52, 53, 67, 54, 18, 31, 55, 47, 
57, 49, 34, 70, 34, 65, 34, 33, 28, 31, 36, 35, 27, 20, 30), 
    groups_age = structure(c(1L, 4L, 4L, 4L, 5L, 4L, 1L, 2L, 
    4L, 3L, 4L, 4L, 2L, 6L, 2L, 5L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 
    1L, 2L), .Label = c("(17,27]", "(27,37]", "(37,47]", "(47,57]", 
    "(57,67]", "(67,77]"), class = "factor")), row.names = c(NA, 
-25L), class = "data.frame")
mybreaks <- seq(min(df$Age) - 1, to=max(df$Age)   10, by=10)

CodePudding user response:

cut calls .bincode which returns integer format for the groups, where 1 is the lowest, then 2 etc. So you could do the formatting manually and then map them to the cluster index e.g.

format_groups <- sapply(1:(length(mybreaks)-1), function(i){
  # to avoid adding 1 to the first cluster:
  if(i>1){
    b1 <- mybreaks[i] 1
  }else{
    b1 <- mybreaks[i]
  } 
  sprintf("(%s-%s)", b1, mybreaks[i 1])
})
df$group_index <- .bincode(df$Age, breaks = mybreaks)
df$groups_age_2 <- format_groups[df$group_index]
  •  Tags:  
  • r
  • Related