Home > Net >  Retaining row pairs and excluding singlets and triplets
Retaining row pairs and excluding singlets and triplets

Time:01-31

Say that I have a data frame like this....

df <- data.frame(ID = c("2280", "2280","2280","2280","3115","2281", "2281","2281","2281", "3282","3282","3282","3282", "3283","3283","3283","3283","1821","1822", "4007", "1145", "1145", "1146", "1147"), sib_ID = c("2282", "2282", "2282", "2282", "3117", "2282", "2282", "2282", "2282", "3284", "3284", "3284","3284", "3284", "3284", "3284", "3284", "1823", "1823","4009", "1148", "1148","1148", "1148"), Age = c("3", "12", "6", "9", "3","9", "6", "12","3","9", "6", "12","3","9", "6", "12", "6", "12","12", "6", "12","12", "6", "6"), Behavior = c("good", "bad", "good", "bad", "good", "good", "good", "bad", "good", "good", "good", "bad","good", "good", "good", "bad", "good", "bad", "good","good", "bad", "good","good", "good"))

> df
     ID sib_ID Age Behavior
1  2280   2282   3     good
2  2280   2282  12      bad
3  2280   2282   6     good
4  2280   2282   9      bad
5  3115   3117   3     good
6  2281   2282   9     good
7  2281   2282   6     good
8  2281   2282  12      bad
9  2281   2282   3     good
10 3282   3284   9     good
11 3282   3284   6     good
12 3282   3284  12      bad
13 3282   3284   3     good
14 3283   3284   9     good
15 3283   3284   6     good
16 3283   3284  12      bad
17 3283   3284   6     good
18 1821   1823  12      bad
19 1822   1823  12     good
20 4007   4009   6     good
21 1145   1148  12      bad
22 1145   1148  12     good
23 1146   1148   6     good
24 1147   1148   6     good

and I want my data frame to only consist of IDs that have a pair. So we would keep data from (2280, 2281), (3282, 3283), (1821, 1822) and remove cases where the ID does not have a pair (3115 and 4007) and cases where we have a triplet (1145, 1146, 1147). What would be the most efficient way to go about doing this?

Example of desired output:

> df
     ID sib_ID Age Behavior
1  2280   2282   3     good
2  2280   2282  12      bad
3  2280   2282   6     good
4  2280   2282   9      bad
6  2281   2282   9     good
7  2281   2282   6     good
8  2281   2282  12      bad
9  2281   2282   3     good
10 3282   3284   9     good
11 3282   3284   6     good
12 3282   3284  12      bad
13 3282   3284   3     good
14 3283   3284   9     good
15 3283   3284   6     good
16 3283   3284  12      bad
17 3283   3284   6     good
18 1821   1823  12      bad
19 1822   1823  12     good

CodePudding user response:

First, I think we need to clarify and solidify the grouping of IDs. I suggest we create a frame up-front to clearly identify these groups, then join back into the original data.

GRPs <- data.frame(ID = as.integer(sort(unique(df$ID)))) %>%
  mutate(GRP = c(0, cumsum(diff(ID) > 1)), ID = as.character(ID))
GRPs
#      ID GRP
# 1  1145   0
# 2  1146   0
# 3  1147   0
# 4  1821   1
# 5  1822   1
# 6  2280   2
# 7  2281   2
# 8  3115   3
# 9  3282   4
# 10 3283   4
# 11 4007   5

From here, we join them back in and then do the grouped determination of "complete" or not.

left_join(df, GRPs, by = "ID") %>%
  group_by(ID) %>%
  mutate(keep1 = all(c("good", "bad") %in% Behavior)) %>%
  group_by(GRP) %>%
  mutate(keep2 = all(c("good", "bad") %in% Behavior)) %>%
  ungroup() %>%
  dplyr::filter(keep1 | keep2)
# # A tibble: 22 × 6
#    ID    Age   Behavior   GRP keep1 keep2
#    <chr> <chr> <chr>    <dbl> <lgl> <lgl>
#  1 2280  3     good         2 TRUE  TRUE 
#  2 2280  12    bad          2 TRUE  TRUE 
#  3 2280  6     good         2 TRUE  TRUE 
#  4 2280  9     bad          2 TRUE  TRUE 
#  5 2281  9     good         2 TRUE  TRUE 
#  6 2281  6     good         2 TRUE  TRUE 
#  7 2281  12    bad          2 TRUE  TRUE 
#  8 2281  3     good         2 TRUE  TRUE 
#  9 3282  9     good         4 TRUE  TRUE 
# 10 3282  6     good         4 TRUE  TRUE 
# # … with 12 more rows
# # ℹ Use `print(n = ...)` to see more rows

Though it's returning more rows ...

  • Related