Home > Software design >  Most efficient way of determing which ID does not have a pair?
Most efficient way of determing which ID does not have a pair?

Time:12-15

Say that I have a dataframe that looks like the one below. In the dataframe we have the following pairs of IDs (4330, 4331), (2333,2334), (3336,3337), which are /- 1 of each other. However, 3349 does not have pair. What would be the most efficient way of filtering out unpaired IDs?

   ID sex zyg race SES
1 4330   2   2    2   1
2 4331   2   2    2   1
3 2333   2   2    1  78
4 2334   2   2    1  78
5 3336   2   2    1  18
6 3337   2   2    1  18
6 3349   2   2    1  18

CodePudding user response:

This will return only pairs/twins (no unpaired or triplets, quadruplets, etc.). In base R:

df <- data.frame(ID = c(1:3, 4330, 4331, 2333, 2334, 3336, 3337, 3349), sex = 2)
df <- df[order(df$ID),]
df[
  rep(
    with(
      rle(diff(df$ID)),
      cumsum(lengths)[lengths == 1L & values == 1]
    ), each = 2
  )   0:1,
]
#>     ID sex
#> 6 2333   2
#> 7 2334   2
#> 8 3336   2
#> 9 3337   2
#> 4 4330   2
#> 5 4331   2

Explanation:

After sorting the data, only individuals in a group (a twin, triplet, etc.) will have an ID difference of 1 from the individual in the next row. diff(df$ID) returns the difference in ID value from one row to the next along the whole data.frame. To identify twins, we want to find where diff(df$ID) has a 1 that is by itself (i.e., neither the previous value nor the next value is also 1). We use rle to find those lone 1s:

rle(diff(df$ID))
#> Run Length Encoding
#>   lengths: int [1:8] 2 1 1 1 1 1 1 1
#>   values : num [1:8] 1 2330 1 1002 1 12 981 1

Lone 1s occur when both the value of diff(df$ID) (values) and the length of runs of the same value (lengths) are both 1. This occurs with the third, fifth, and eighth run. The starting rows (within df) of all runs are given by cumsum(lengths), so we subset them at 3, 5, and 8 to get the starting index of each twin pair in df. We repeat each of those indices twice with rep(..., each = 2) then add 0:1 (taking advantage of recycling in R) to get the indices of any individual who is a twin.

CodePudding user response:

Using dplyr::lag() and lead(), you can filter() to rows where the previous ID is ID - 1 or the next ID is ID 1:

library(dplyr)

df %>% 
  filter(lag(ID) == ID - 1 | lead(ID) == ID   1)
# A tibble: 6 × 5
     ID   sex   zyg  race   SES
  <dbl> <dbl> <dbl> <dbl> <dbl>
1  4330     2     2     2     1
2  4331     2     2     2     1
3  2333     2     2     1    78
4  2334     2     2     1    78
5  3336     2     2     1    18
6  3337     2     2     1    18

*edit, this will not filter out "triplets," "quadruplets," etc., contrary to the additional requirements mentioned in the comments.

  • Related