I have the following dataset
df<- data.frame(x1=c(1,5,7,8,2,2,3,4,5,10),
birthyear=c(1992,1994,1993,1992,1995,1999,2000,2001,2000, 1994))
I want to group persons in 3-year intervals together so that persons born in 1992-1994 are group 1 and 1995-1997 are in group 2 and so on. I have a far larger dataset with over 10000 entries. How could I do it the most efficient way?
CodePudding user response:
I would simply use cut
with breaks defined with seq
:
df$group <- cut(df$birthyear,
seq(1992, 2022, 3),
labels = F,
right = F)
df
Output:
#> x1 birthyear group
#> 1 1 1992 1
#> 2 5 1994 1
#> 3 7 1993 1
#> 4 8 1992 1
#> 5 2 1995 2
#> 6 2 1999 3
#> 7 3 2000 3
#> 8 4 2001 4
#> 9 5 2000 3
#> 10 10 1994 1
Created on 2022-05-03 by the reprex package (v2.0.1)
CodePudding user response:
Here is a rather manual approach using case_when
, where you define the span of years for each group. When using case_when
, you define a condition, e.g. birthyear > 1991 & birthyear < 1995
, and the outcome using a tilde ~
, e.g. ~ 1
.
library(dplyr)
df<- data.frame(x1=c(1,5,7,8,2,2,3,4,5,10),
birthyear=c(1992,1994,1993,1992,1995,1999,2000,2001,2000, 1994))
df %>%
mutate(group = case_when(
birthyear > 1991 & birthyear < 1995 ~ 1,
birthyear > 1994 & birthyear < 1997 ~ 2,
birthyear > 1997 & birthyear < 2002 ~ 3
))
#> x1 birthyear group
#> 1 1 1992 1
#> 2 5 1994 1
#> 3 7 1993 1
#> 4 8 1992 1
#> 5 2 1995 2
#> 6 2 1999 3
#> 7 3 2000 3
#> 8 4 2001 3
#> 9 5 2000 3
#> 10 10 1994 1
Created on 2022-05-03 by the reprex package (v0.3.0)
CodePudding user response:
Using integer division %/%
might be an efficient way.
df$group <- (df$birthyear - 1989L) %/% 3L
df
# x1 birthyear group
#1 1 1992 1
#2 5 1994 1
#3 7 1993 1
#4 8 1992 1
#5 2 1995 2
#6 2 1999 3
#7 3 2000 3
#8 4 2001 4
#9 5 2000 3
#10 10 1994 1
To start from the lowest birthyear:
(df$birthyear - min(df$birthyear) 3L) %/% 3L
# [1] 1 1 1 1 2 3 3 4 3 1
Benchmark:
bench::mark(
"cut" = cut(df$birthyear, seq(1992, 2022, 3), labels = F, right = F),
"%/%min" = (df$birthyear - min(df$birthyear) 3L) %/% 3L,
"%/%" = (df$birthyear - 1989L) %/% 3L
)
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
# <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#1 cut 73.92µs 79.17µs 11627. 0B 12.3 5679 6 488.4ms
#2 %/%min 2.85µs 3.09µs 314123. 0B 31.4 9999 1 31.8ms
#3 %/% 1.64µs 1.81µs 517164. 0B 0 10000 0 19.3ms
Using integer division %/%
is in this example about 40 times faster than using cut
.