I have a df like this one :
ID matching_variable status
1 1 case
2 1 control
3 2 case
4 2 case
5 3 control
6 3 control
7 4 case
8 4 control
9 5 case
10 6 control
I would like to keep all my "pairs" of subjects that are matched (that have the same matching variable) and for which there is 1 case and 1 control (such as the pair corresponding to matching variable = 1 or to maching variable = 4)
So, I would like to remove the matched subjects for which there are only cases (such as matching_variable =2) or only controls (such as matching_variable =3) and the subjects that are alone (that have not been matched) (such as the last 2 subjects)
The expected result would be this:
ID matching_variable status
1 1 case
2 1 control
7 4 case
8 4 control
I'm sure it's not too complicated but I have no idea how to go about it...
Thanks in advance for the help
CodePudding user response:
An idea via base R,
df[as.logical(with(df, ave(status, matching_variable, FUN = function(i)length(unique(i)) > 1))),]
ID matching_variable status
1 1 1 case
2 2 1 control
7 7 4 case
8 8 4 control
CodePudding user response:
Try this approach:
library(tidyverse)
tribble(
~ID, ~matching_variable, ~status,
1, 1, "case",
2, 1, "control",
3, 2, "case",
4, 2, "case",
5, 3, "control",
6, 3, "control",
7, 4, "case",
8, 4, "control",
9, 5, "case",
10, 6, "control"
) |>
group_by(matching_variable) |>
filter(first(status) != last(status))
#> # A tibble: 4 × 3
#> # Groups: matching_variable [2]
#> ID matching_variable status
#> <dbl> <dbl> <chr>
#> 1 1 1 case
#> 2 2 1 control
#> 3 7 4 case
#> 4 8 4 control
Created on 2022-04-28 by the reprex package (v2.0.1)
CodePudding user response:
Using the tidyverse...
library(dplyr)
df %>%
group_by(matching_variable) %>%
filter(length(unique(status)) == 2 & length(status) == 2)
ID matching_variable status
<int> <int> <chr>
1 1 1 case
2 2 1 control
3 7 4 case
4 8 4 control
The filter makes sure there are exactly two different types in each grouped matching variable, but no more than two entries in total (in case you get two cases and one control, for example, which I think you would want to reject).
CodePudding user response:
Another possible solution:
library(tidyverse)
df %>%
group_by(matching_variable) %>%
mutate(n = n_distinct(status)) %>%
ungroup %>%
filter(n > 1) %>%
select(-n)
#> # A tibble: 4 × 3
#> ID matching_variable status
#> <int> <int> <chr>
#> 1 1 1 case
#> 2 2 1 control
#> 3 7 4 case
#> 4 8 4 control
CodePudding user response:
Here is one more:
library(dplyr)
df %>%
group_by(matching_variable) %>%
filter(n() == 2 &
!duplicated(status) &
(!duplicated(status, fromLast = TRUE)))
ID matching_variable status
<int> <int> <chr>
1 1 1 case
2 2 1 control
3 7 4 case
4 8 4 control