Say that I have a data frame like this....
df <- data.frame(ID = c("2280", "2280","2280","2280","3115","2281", "2281","2281","2281", "3282","3282","3282","3282", "3283","3283","3283","3283","1821","1822", "4007", "1145", "1145", "1146", "1147"), sib_ID = c("2282", "2282", "2282", "2282", "3117", "2282", "2282", "2282", "2282", "3284", "3284", "3284","3284", "3284", "3284", "3284", "3284", "1823", "1823","4009", "1148", "1148","1148", "1148"), Age = c("3", "12", "6", "9", "3","9", "6", "12","3","9", "6", "12","3","9", "6", "12", "6", "12","12", "6", "12","12", "6", "6"), Behavior = c("good", "bad", "good", "bad", "good", "good", "good", "bad", "good", "good", "good", "bad","good", "good", "good", "bad", "good", "bad", "good","good", "bad", "good","good", "good"))
> df
ID sib_ID Age Behavior
1 2280 2282 3 good
2 2280 2282 12 bad
3 2280 2282 6 good
4 2280 2282 9 bad
5 3115 3117 3 good
6 2281 2282 9 good
7 2281 2282 6 good
8 2281 2282 12 bad
9 2281 2282 3 good
10 3282 3284 9 good
11 3282 3284 6 good
12 3282 3284 12 bad
13 3282 3284 3 good
14 3283 3284 9 good
15 3283 3284 6 good
16 3283 3284 12 bad
17 3283 3284 6 good
18 1821 1823 12 bad
19 1822 1823 12 good
20 4007 4009 6 good
21 1145 1148 12 bad
22 1145 1148 12 good
23 1146 1148 6 good
24 1147 1148 6 good
and I want my data frame to only consist of IDs that have a pair. So we would keep data from (2280, 2281), (3282, 3283), (1821, 1822) and remove cases where the ID does not have a pair (3115 and 4007) and cases where we have a triplet (1145, 1146, 1147). What would be the most efficient way to go about doing this?
Example of desired output:
> df
ID sib_ID Age Behavior
1 2280 2282 3 good
2 2280 2282 12 bad
3 2280 2282 6 good
4 2280 2282 9 bad
6 2281 2282 9 good
7 2281 2282 6 good
8 2281 2282 12 bad
9 2281 2282 3 good
10 3282 3284 9 good
11 3282 3284 6 good
12 3282 3284 12 bad
13 3282 3284 3 good
14 3283 3284 9 good
15 3283 3284 6 good
16 3283 3284 12 bad
17 3283 3284 6 good
18 1821 1823 12 bad
19 1822 1823 12 good
CodePudding user response:
First, I think we need to clarify and solidify the grouping of IDs. I suggest we create a frame up-front to clearly identify these groups, then join back into the original data.
GRPs <- data.frame(ID = as.integer(sort(unique(df$ID)))) %>%
mutate(GRP = c(0, cumsum(diff(ID) > 1)), ID = as.character(ID))
GRPs
# ID GRP
# 1 1145 0
# 2 1146 0
# 3 1147 0
# 4 1821 1
# 5 1822 1
# 6 2280 2
# 7 2281 2
# 8 3115 3
# 9 3282 4
# 10 3283 4
# 11 4007 5
From here, we join them back in and then do the grouped determination of "complete" or not.
left_join(df, GRPs, by = "ID") %>%
group_by(ID) %>%
mutate(keep1 = all(c("good", "bad") %in% Behavior)) %>%
group_by(GRP) %>%
mutate(keep2 = all(c("good", "bad") %in% Behavior)) %>%
ungroup() %>%
dplyr::filter(keep1 | keep2)
# # A tibble: 22 × 6
# ID Age Behavior GRP keep1 keep2
# <chr> <chr> <chr> <dbl> <lgl> <lgl>
# 1 2280 3 good 2 TRUE TRUE
# 2 2280 12 bad 2 TRUE TRUE
# 3 2280 6 good 2 TRUE TRUE
# 4 2280 9 bad 2 TRUE TRUE
# 5 2281 9 good 2 TRUE TRUE
# 6 2281 6 good 2 TRUE TRUE
# 7 2281 12 bad 2 TRUE TRUE
# 8 2281 3 good 2 TRUE TRUE
# 9 3282 9 good 4 TRUE TRUE
# 10 3282 6 good 4 TRUE TRUE
# # … with 12 more rows
# # ℹ Use `print(n = ...)` to see more rows
Though it's returning more rows ...