Dropping cases where the values of two columns are same-CodePudding

I would like to ask for your help with figuring out how to code dropping those instances where the values of two columns are the same within a cluster/group.

Example data:

structure(list(dyad_id = structure(c(1L, 1L, 2L, 2L, 2L), .Label = c("111_222", 
"111_333"), class = "factor"), alter = c(111L, 111L, 111L, 333L, 
111L), ego = c(222L, 222L, 333L, 111L, 333L), score = c(3L, 2L, 
4L, 4L, 3L)), class = "data.frame", row.names = c(NA, -5L))

What the example data looks like:

 --------- ------- ----- ------- 
| dyad_id | alter | ego | score |
 --------- ------- ----- ------- 
| 111_222 | 111   | 222 | 3     |
 --------- ------- ----- ------- 
| 111_222 | 111   | 222 | 2     |
 --------- ------- ----- ------- 
| 111_333 | 111   | 333 | 4     |
 --------- ------- ----- ------- 
| 111_333 | 333   | 111 | 4     |
 --------- ------- ----- ------- 
| 111_333 | 111   | 333 | 3     |
 --------- ------- ----- -------

What I'm looking for:

I'm hoping to drop/exclude observations of dyad_id 111_222 because alter and ego are the same within the dyad. On the other hand, I would keep dyad 111_333 because there is at least one case where alter and ego are switched.

The output of what I'm looking for:

 --------- ------- ----- ------- 
| dyad_id | alter | ego | score |
 --------- ------- ----- ------- 
| 111_333 | 111   | 333 | 4     |
 --------- ------- ----- ------- 
| 111_333 | 333   | 111 | 4     |
 --------- ------- ----- ------- 
| 111_333 | 111   | 333 | 3     |
 --------- ------- ----- -------

CodePudding user response：

Similar to @langtang 's data.table solution but in dplyr -

library(dplyr)

df %>%
  group_by(dyad_id) %>%
  filter(n_distinct(paste(alter, ego)) > 1) %>%
  ungroup

#  dyad_id alter   ego score
#  <fct>   <int> <int> <int>
#1 111_333   111   333     4
#2 111_333   333   111     4
#3 111_333   111   333     3

CodePudding user response：

I think what you're looking for is distinct(), as in:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- tribble(
   ~dyad_id, ~alter, ~ego, ~score,
  "111_222",   111L, 222L,     3L,
  "111_222",   111L, 222L,     2L,
  "111_333",   111L, 333L,     4L,
  "111_333",   333L, 111L,     4L,
  "111_333",   111L, 333L,     3L
  )

df %>%
  distinct(dyad_id, alter, ego)
#> # A tibble: 3 × 3
#>   dyad_id alter   ego
#>   <chr>   <int> <int>
#> 1 111_222   111   222
#> 2 111_333   111   333
#> 3 111_333   333   111

^{Created on 2022-03-15 by the reprex package (v2.0.1)}

This drops the score column - unclear if you want to do that. Another way is to just retain the first row from each group, as in:


df %>%
  group_by(dyad_id, alter, ego) %>%
  slice(1)
#> # A tibble: 3 × 4
#> # Groups:   dyad_id, alter, ego [3]
#>   dyad_id alter   ego score
#>   <chr>   <int> <int> <int>
#> 1 111_222   111   222     3
#> 2 111_333   111   333     4
#> 3 111_333   333   111     4

CodePudding user response：

There is a probably a more efficient way to do this but the below works:

Basically: 1.create a column that extracts out the first 3 digits and do logic test 2. create column to extract last 3 digits and do logic test 3.group_by the grouping variable 4.sum the logic outputs, TRUE equals 1 5.then filter across the columns if both conditions are 1 6. then do anti join of this against original column

df_logic <- df %>% mutate(alter_logic=as.numeric(
    str_extract(dyad_id,"^[:digit:]{3}"))!=alter,
    ego_logic=as.numeric(
      str_extract(dyad_id,"[:digit:]*${3}"))!=ego) %>% 
  group_by(dyad_id) %>% 
  summarize(across(contains("logic"),sum)) %>% 
  filter(if_all(contains("logic"),~.x>0))

df_filter <- anti_join(x=df,y=logic, by="dyad_id")

CodePudding user response：

You can try this:

inner_join(
  df,
  df %>%
    distinct(dyad_id, alter, ego) %>%
    count(dyad_id) %>%
    filter(n>1)
  ) %>% select(-n)


  dyad_id alter ego score
1 111_333   111 333     4
2 111_333   333 111     4
3 111_333   111 333     3

Here is a data.table solution

library(data.table)

setDT(df)[, N:=uniqueN(paste(alter,ego)), by=dyad_id][N>1][,N:=NULL]

   dyad_id alter   ego score
    <fctr> <int> <int> <int>
1: 111_333   111   333     4
2: 111_333   333   111     4
3: 111_333   111   333     3