delete subjects that are not matched R-CodePudding

I have a df like this one :

ID  matching_variable   status
 1     1                 case
 2     1                 control
 3     2                 case
 4     2                 case
 5     3                 control
 6     3                 control
 7     4                 case
 8     4                 control
 9     5                 case
10     6                 control

I would like to keep all my "pairs" of subjects that are matched (that have the same matching variable) and for which there is 1 case and 1 control (such as the pair corresponding to matching variable = 1 or to maching variable = 4)

So, I would like to remove the matched subjects for which there are only cases (such as matching_variable =2) or only controls (such as matching_variable =3) and the subjects that are alone (that have not been matched) (such as the last 2 subjects)

The expected result would be this:

 ID matching_variable   status
  1         1           case
  2         1           control
  7         4           case
  8         4           control

I'm sure it's not too complicated but I have no idea how to go about it...

Thanks in advance for the help

CodePudding user response：

An idea via base R,

df[as.logical(with(df, ave(status, matching_variable, FUN = function(i)length(unique(i)) > 1))),]

  ID matching_variable  status
1  1                 1    case
2  2                 1 control
7  7                 4    case
8  8                 4 control

CodePudding user response：

Try this approach:

library(tidyverse)

tribble(
  ~ID, ~matching_variable, ~status,
  1, 1, "case",
  2, 1, "control",
  3, 2, "case",
  4, 2, "case",
  5, 3, "control",
  6, 3, "control",
  7, 4, "case",
  8, 4, "control",
  9, 5, "case",
  10, 6, "control"
) |> 
  group_by(matching_variable) |> 
  filter(first(status) != last(status))
#> # A tibble: 4 × 3
#> # Groups:   matching_variable [2]
#>      ID matching_variable status 
#>   <dbl>             <dbl> <chr>  
#> 1     1                 1 case   
#> 2     2                 1 control
#> 3     7                 4 case   
#> 4     8                 4 control

^{Created on 2022-04-28 by the reprex package (v2.0.1)}

CodePudding user response：

Using the tidyverse...

library(dplyr)

df %>% 
    group_by(matching_variable) %>% 
    filter(length(unique(status)) == 2 & length(status) == 2)

     ID matching_variable status 
  <int>             <int> <chr>  
1     1                 1 case   
2     2                 1 control
3     7                 4 case   
4     8                 4 control

The filter makes sure there are exactly two different types in each grouped matching variable, but no more than two entries in total (in case you get two cases and one control, for example, which I think you would want to reject).

CodePudding user response：

Another possible solution:

library(tidyverse)

df %>% 
  group_by(matching_variable) %>% 
  mutate(n = n_distinct(status)) %>% 
  ungroup %>% 
  filter(n > 1) %>% 
  select(-n)

#> # A tibble: 4 × 3
#>      ID matching_variable status 
#>   <int>             <int> <chr>  
#> 1     1                 1 case   
#> 2     2                 1 control
#> 3     7                 4 case   
#> 4     8                 4 control

CodePudding user response：

Here is one more:

library(dplyr)

df %>%
   group_by(matching_variable) %>%
   filter(n() == 2 & 
            !duplicated(status) &
            (!duplicated(status, fromLast = TRUE)))

     ID matching_variable status 
  <int>             <int> <chr>  
1     1                 1 case   
2     2                 1 control
3     7                 4 case   
4     8                 4 control