I would like to ask for your help with figuring out how to code dropping those instances where the values of two columns are the same within a cluster/group.
Example data:
structure(list(dyad_id = structure(c(1L, 1L, 2L, 2L, 2L), .Label = c("111_222",
"111_333"), class = "factor"), alter = c(111L, 111L, 111L, 333L,
111L), ego = c(222L, 222L, 333L, 111L, 333L), score = c(3L, 2L,
4L, 4L, 3L)), class = "data.frame", row.names = c(NA, -5L))
What the example data looks like:
--------- ------- ----- -------
| dyad_id | alter | ego | score |
--------- ------- ----- -------
| 111_222 | 111 | 222 | 3 |
--------- ------- ----- -------
| 111_222 | 111 | 222 | 2 |
--------- ------- ----- -------
| 111_333 | 111 | 333 | 4 |
--------- ------- ----- -------
| 111_333 | 333 | 111 | 4 |
--------- ------- ----- -------
| 111_333 | 111 | 333 | 3 |
--------- ------- ----- -------
What I'm looking for:
I'm hoping to drop/exclude observations of dyad_id 111_222 because alter and ego are the same within the dyad. On the other hand, I would keep dyad 111_333 because there is at least one case where alter and ego are switched.
The output of what I'm looking for:
--------- ------- ----- -------
| dyad_id | alter | ego | score |
--------- ------- ----- -------
| 111_333 | 111 | 333 | 4 |
--------- ------- ----- -------
| 111_333 | 333 | 111 | 4 |
--------- ------- ----- -------
| 111_333 | 111 | 333 | 3 |
--------- ------- ----- -------
CodePudding user response:
Similar to @langtang 's data.table
solution but in dplyr
-
library(dplyr)
df %>%
group_by(dyad_id) %>%
filter(n_distinct(paste(alter, ego)) > 1) %>%
ungroup
# dyad_id alter ego score
# <fct> <int> <int> <int>
#1 111_333 111 333 4
#2 111_333 333 111 4
#3 111_333 111 333 3
CodePudding user response:
I think what you're looking for is distinct()
, as in:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- tribble(
~dyad_id, ~alter, ~ego, ~score,
"111_222", 111L, 222L, 3L,
"111_222", 111L, 222L, 2L,
"111_333", 111L, 333L, 4L,
"111_333", 333L, 111L, 4L,
"111_333", 111L, 333L, 3L
)
df %>%
distinct(dyad_id, alter, ego)
#> # A tibble: 3 × 3
#> dyad_id alter ego
#> <chr> <int> <int>
#> 1 111_222 111 222
#> 2 111_333 111 333
#> 3 111_333 333 111
Created on 2022-03-15 by the reprex package (v2.0.1)
This drops the score
column - unclear if you want to do that. Another way is to just retain the first row from each group, as in:
df %>%
group_by(dyad_id, alter, ego) %>%
slice(1)
#> # A tibble: 3 × 4
#> # Groups: dyad_id, alter, ego [3]
#> dyad_id alter ego score
#> <chr> <int> <int> <int>
#> 1 111_222 111 222 3
#> 2 111_333 111 333 4
#> 3 111_333 333 111 4
CodePudding user response:
There is a probably a more efficient way to do this but the below works:
Basically: 1.create a column that extracts out the first 3 digits and do logic test 2. create column to extract last 3 digits and do logic test 3.group_by the grouping variable 4.sum the logic outputs, TRUE equals 1 5.then filter across the columns if both conditions are 1 6. then do anti join of this against original column
df_logic <- df %>% mutate(alter_logic=as.numeric(
str_extract(dyad_id,"^[:digit:]{3}"))!=alter,
ego_logic=as.numeric(
str_extract(dyad_id,"[:digit:]*${3}"))!=ego) %>%
group_by(dyad_id) %>%
summarize(across(contains("logic"),sum)) %>%
filter(if_all(contains("logic"),~.x>0))
df_filter <- anti_join(x=df,y=logic, by="dyad_id")
CodePudding user response:
You can try this:
inner_join(
df,
df %>%
distinct(dyad_id, alter, ego) %>%
count(dyad_id) %>%
filter(n>1)
) %>% select(-n)
dyad_id alter ego score
1 111_333 111 333 4
2 111_333 333 111 4
3 111_333 111 333 3
Here is a data.table solution
library(data.table)
setDT(df)[, N:=uniqueN(paste(alter,ego)), by=dyad_id][N>1][,N:=NULL]
dyad_id alter ego score
<fctr> <int> <int> <int>
1: 111_333 111 333 4
2: 111_333 333 111 4
3: 111_333 111 333 3