I have this data:
data <- data.frame(id_pers=c(4102,13102,27101,27102,28101,28102, 42101,42102,56102,73102,74103,103104,117103,117104,117105),
birthyear=c(1992,1994,1993,1992,1995,1999,2000,2001,2000, 1994, 1999, 1978, 1986, 1998, 1999))
I want to group the different persons by familys in a new column, so that persons 27101,27102 (siblings) are group/family 1 and 42101,42102 are in group 2, 117103,117104,117105 are in group 3 so on. Person "4102" has no siblings and should be a NA in the new column. It is always the case that 2 or more persons are siblings if the ID's are not further apart than a maximum of 6 numbers. I have a far larger dataset with over 3000 rows. How could I do it the most efficient way?
CodePudding user response:
You can use round
with digits = -1
:
data %>%
group_by(fam_id = round(id_pers - 5, digits = -1))
output
# A tibble: 15 × 3
# Groups: fam_id [10]
id_pers birthyear fam_id
<dbl> <dbl> <dbl>
1 4102 1992 4100
2 13102 1994 13100
3 27101 1993 27100
4 27102 1992 27100
5 28101 1995 28100
6 28106 1999 28100
7 42101 2000 42100
8 42102 2001 42100
9 56102 2000 56100
10 73102 1994 73100
11 74103 1999 74100
12 103104 1978 103100
13 117103 1986 117100
14 117104 1998 117100
15 117105 1999 117100
CodePudding user response:
It looks like we can the 1000s digit (and above) to delineate groups.
library(dplyr)
data %>%
mutate(
famgroup = trunc(id_pers/1000),
famgroup = match(famgroup, unique(famgroup))
)
# id_pers birthyear famgroup
# 1 4102 1992 1
# 2 13102 1994 2
# 3 27101 1993 3
# 4 27102 1992 3
# 5 28101 1995 4
# 6 28102 1999 4
# 7 42101 2000 5
# 8 42102 2001 5
# 9 56102 2000 6
# 10 73102 1994 7
# 11 74103 1999 8
# 12 103104 1978 9
# 13 117103 1986 10
# 14 117104 1998 10
# 15 117105 1999 10