Say that I have a dataframe that looks like the one below. In the dataframe we have the following pairs of IDs (4330, 4331), (2333,2334), (3336,3337), which are /- 1 of each other. However, 3349 does not have pair. What would be the most efficient way of filtering out unpaired IDs?
ID sex zyg race SES
1 4330 2 2 2 1
2 4331 2 2 2 1
3 2333 2 2 1 78
4 2334 2 2 1 78
5 3336 2 2 1 18
6 3337 2 2 1 18
6 3349 2 2 1 18
CodePudding user response:
This will return only pairs/twins (no unpaired or triplets, quadruplets, etc.). In base R:
df <- data.frame(ID = c(1:3, 4330, 4331, 2333, 2334, 3336, 3337, 3349), sex = 2)
df <- df[order(df$ID),]
df[
rep(
with(
rle(diff(df$ID)),
cumsum(lengths)[lengths == 1L & values == 1]
), each = 2
) 0:1,
]
#> ID sex
#> 6 2333 2
#> 7 2334 2
#> 8 3336 2
#> 9 3337 2
#> 4 4330 2
#> 5 4331 2
Explanation:
After sorting the data, only individuals in a group (a twin, triplet, etc.) will have an ID difference of 1 from the individual in the next row. diff(df$ID)
returns the difference in ID
value from one row to the next along the whole data.frame
. To identify twins, we want to find where diff(df$ID)
has a 1
that is by itself (i.e., neither the previous value nor the next value is also 1
). We use rle
to find those lone 1
s:
rle(diff(df$ID))
#> Run Length Encoding
#> lengths: int [1:8] 2 1 1 1 1 1 1 1
#> values : num [1:8] 1 2330 1 1002 1 12 981 1
Lone 1
s occur when both the value of diff(df$ID)
(values
) and the length of runs of the same value (lengths
) are both 1
. This occurs with the third, fifth, and eighth run. The starting rows (within df
) of all runs are given by cumsum(lengths)
, so we subset them at 3, 5, and 8 to get the starting index of each twin pair in df
. We repeat each of those indices twice with rep(..., each = 2)
then add 0:1
(taking advantage of recycling in R) to get the indices of any individual who is a twin.
CodePudding user response:
Using dplyr::lag()
and lead()
, you can filter()
to rows where the previous ID
is ID - 1
or the next ID
is ID 1
:
library(dplyr)
df %>%
filter(lag(ID) == ID - 1 | lead(ID) == ID 1)
# A tibble: 6 × 5
ID sex zyg race SES
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4330 2 2 2 1
2 4331 2 2 2 1
3 2333 2 2 1 78
4 2334 2 2 1 78
5 3336 2 2 1 18
6 3337 2 2 1 18
*edit, this will not filter out "triplets," "quadruplets," etc., contrary to the additional requirements mentioned in the comments.