Group persons based on birth of year in R-CodePudding

I have the following dataset

df<- data.frame(x1=c(1,5,7,8,2,2,3,4,5,10),
birthyear=c(1992,1994,1993,1992,1995,1999,2000,2001,2000, 1994))

I want to group persons in 3-year intervals together so that persons born in 1992-1994 are group 1 and 1995-1997 are in group 2 and so on. I have a far larger dataset with over 10000 entries. How could I do it the most efficient way?

CodePudding user response：

I would simply use cut with breaks defined with seq:

df$group <- cut(df$birthyear,
                seq(1992, 2022, 3),
                labels = F,
                right = F)
df

Output:

#>    x1 birthyear group
#> 1   1      1992     1
#> 2   5      1994     1
#> 3   7      1993     1
#> 4   8      1992     1
#> 5   2      1995     2
#> 6   2      1999     3
#> 7   3      2000     3
#> 8   4      2001     4
#> 9   5      2000     3
#> 10 10      1994     1

^{Created on 2022-05-03 by the reprex package (v2.0.1)}

CodePudding user response：

Here is a rather manual approach using case_when, where you define the span of years for each group. When using case_when, you define a condition, e.g. birthyear > 1991 & birthyear < 1995, and the outcome using a tilde ~, e.g. ~ 1.

library(dplyr)

df<- data.frame(x1=c(1,5,7,8,2,2,3,4,5,10),
                birthyear=c(1992,1994,1993,1992,1995,1999,2000,2001,2000, 1994))

df %>% 
  mutate(group = case_when(
    birthyear > 1991 & birthyear < 1995 ~ 1,
    birthyear > 1994 & birthyear < 1997 ~ 2,
    birthyear > 1997 & birthyear < 2002 ~ 3
  ))

#>    x1 birthyear group
#> 1   1      1992     1
#> 2   5      1994     1
#> 3   7      1993     1
#> 4   8      1992     1
#> 5   2      1995     2
#> 6   2      1999     3
#> 7   3      2000     3
#> 8   4      2001     3
#> 9   5      2000     3
#> 10 10      1994     1

^{Created on 2022-05-03 by the reprex package (v0.3.0)}

CodePudding user response：

Using integer division %/% might be an efficient way.

df$group <- (df$birthyear - 1989L) %/% 3L
df
#   x1 birthyear group
#1   1      1992     1
#2   5      1994     1
#3   7      1993     1
#4   8      1992     1
#5   2      1995     2
#6   2      1999     3
#7   3      2000     3
#8   4      2001     4
#9   5      2000     3
#10 10      1994     1

To start from the lowest birthyear:

(df$birthyear - min(df$birthyear)   3L) %/% 3L
# [1] 1 1 1 1 2 3 3 4 3 1

Benchmark:

bench::mark(
         "cut" = cut(df$birthyear, seq(1992, 2022, 3), labels = F, right = F),
         "%/%min" = (df$birthyear - min(df$birthyear)   3L) %/% 3L,
         "%/%" = (df$birthyear - 1989L) %/% 3L
         )
#  expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#1 cut        73.92µs 79.17µs    11627.        0B     12.3  5679     6    488.4ms
#2 %/%min      2.85µs  3.09µs   314123.        0B     31.4  9999     1     31.8ms
#3 %/%         1.64µs  1.81µs   517164.        0B      0   10000     0     19.3ms

Using integer division %/% is in this example about 40 times faster than using cut.