Home > Software design >  Filter unique combination of two columns
Filter unique combination of two columns

Time:12-14

Sample data:

structure(list(name_1 = c("Kevin", "Tom", "Laura", "Julie"), 
    name_2 = c("Tom", "Kevin", "Julie", "Laura"), value = c(10, 
    10, 20, 20)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-4L))

# A tibble: 4 × 3
  name_1 name_2 value
  <chr>  <chr>  <dbl>
1 Kevin  Tom       10
2 Tom    Kevin     10
3 Laura  Julie     20
4 Julie  Laura     20

How can I filter unique combination of the columns name_1 and name_2 without subsetting every other row? A tidyverse/dplyr method is prefered.

The expected output is as follows:

# A tibble: 2 × 3
  name_1 name_2 value
  <chr>  <chr>  <dbl>
1 Kevin  Tom       10
2 Laura  Julie     20

CodePudding user response:

We can sort the names (by row, using pmin/pmax) and then distinct based on the sorted values. In this case, I removed the sorted columns but you can choose better names and drop the originals if you prefer.

library(dplyr)
quux %>%
  mutate(a=pmin(name_1, name_2), b=pmax(name_1, name_2)) %>%
  distinct(a, b, .keep_all = TRUE) %>%
  select(-a, -b)
# # A tibble: 2 x 3
#   name_1 name_2 value
#   <chr>  <chr>  <dbl>
# 1 Kevin  Tom       10
# 2 Laura  Julie     20

Data

quux <- structure(list(name_1 = c("Kevin", "Tom", "Laura", "Julie"), name_2 = c("Tom", "Kevin", "Julie", "Laura"), value = c(10, 10, 20, 20)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L))

CodePudding user response:

Are you looking for something like this?

df[which(duplicated(apply(df[-3], 1, function(i) toString(sort(i))))),]

# A tibble: 2 × 3
  name_1 name_2 value
  <chr>  <chr>  <dbl>
1 Tom    Kevin     10
2 Julie  Laura     20
  • Related