Is it possible to make groups based on an ID of a person in R?-CodePudding

I have this data:

data <- data.frame(id_pers=c(4102,13102,27101,27102,28101,28102, 42101,42102,56102,73102,74103,103104,117103,117104,117105),
                   birthyear=c(1992,1994,1993,1992,1995,1999,2000,2001,2000, 1994, 1999, 1978, 1986, 1998, 1999))

I want to group the different persons by familys in a new column, so that persons 27101,27102 (siblings) are group/family 1 and 42101,42102 are in group 2, 117103,117104,117105 are in group 3 so on. Person "4102" has no siblings and should be a NA in the new column. It is always the case that 2 or more persons are siblings if the ID's are not further apart than a maximum of 6 numbers. I have a far larger dataset with over 3000 rows. How could I do it the most efficient way?

CodePudding user response：

You can use round with digits = -1:

data %>% 
  group_by(fam_id = round(id_pers - 5, digits = -1))

output

# A tibble: 15 × 3
# Groups:   fam_id [10]
   id_pers birthyear fam_id
     <dbl>     <dbl>  <dbl>
 1    4102      1992   4100
 2   13102      1994  13100
 3   27101      1993  27100
 4   27102      1992  27100
 5   28101      1995  28100
 6   28106      1999  28100
 7   42101      2000  42100
 8   42102      2001  42100
 9   56102      2000  56100
10   73102      1994  73100
11   74103      1999  74100
12  103104      1978 103100
13  117103      1986 117100
14  117104      1998 117100
15  117105      1999 117100

CodePudding user response：

It looks like we can the 1000s digit (and above) to delineate groups.

library(dplyr)
data %>%
  mutate(
    famgroup = trunc(id_pers/1000),
    famgroup = match(famgroup, unique(famgroup))
  )
#    id_pers birthyear famgroup
# 1     4102      1992        1
# 2    13102      1994        2
# 3    27101      1993        3
# 4    27102      1992        3
# 5    28101      1995        4
# 6    28102      1999        4
# 7    42101      2000        5
# 8    42102      2001        5
# 9    56102      2000        6
# 10   73102      1994        7
# 11   74103      1999        8
# 12  103104      1978        9
# 13  117103      1986       10
# 14  117104      1998       10
# 15  117105      1999       10