Home > Enterprise >  Identify pairs or groups of rows that have the same values across multiple columns
Identify pairs or groups of rows that have the same values across multiple columns

Time:08-20

Say I have a data.frame:

file = read.table(text = "sex age num          
M 32 5
F 31 2
M 91 2
M 30 1
M 23 1
F 19 1
F 31 2
F 21 2
M 32 5
F 65 3
M 24 5", header = T, sep = "")

I want to get a sorted data frame of all rows that have the exact same values of sex, age, and num with any other row in the data frame.

The result should look like this (note that the data frame is sorted by the pairs or groups that are duplicated with each other):

result = read.table(text = "sex age num          
M 32 5
M 32 5
F 31 2
F 31 2", header = T, sep = "")

I have tried various combinations of distinct in dplyr and duplicated, but they don't quite get at this use case.

CodePudding user response:

We need duplicated twice i.e. one duplicated in the normal direction from up to bottom and second from bottom to top (fromLast = TRUE) and then use | so that it can be TRUE in either direction for subsetting

out <- file[duplicated(file)|duplicated(file, fromLast = TRUE),]
out$sex <- factor(out$sex, levels = c("M", "F"))
out1 <- out[do.call(order, out),]
row.names(out1) <- NULL

-output

> out1
  sex age num
1   M  32   5
2   M  32   5
3   F  31   2
4   F  31   2

The above can be written in tidyverse

library(dplyr)
file %>%
   arrange(sex == "F", across(everything())) %>% 
   filter(duplicated(.)|duplicated(., fromLast = TRUE))
  sex age num
1   M  32   5
2   M  32   5
3   F  31   2
4   F  31   2

CodePudding user response:

An alternative approach:

Here all groups with more then 1 nrow will be kept:

library(dplyr)

file %>% 
  group_by(sex, age, num) %>% 
  filter(n() > 1) %>%
  arrange(.by_group = T)
  ungroup()
 sex     age   num
  <chr> <int> <int>
1 F        31     2
2 F        31     2
3 M        32     5
4 M        32     5

CodePudding user response:

file = read.table(text = "sex age num          
M 32 5
F 31 2
M 91 2
M 30 1
M 23 1
F 19 1
F 31 2
F 21 2
M 32 5
F 65 3
M 24 5", header = T, sep = "")

library(vctrs)
library(dplyr, warn = F)
#> Warning: package 'dplyr' was built under R version 4.1.2

file %>% 
  filter(vec_duplicate_detect(.)) %>% 
  arrange(across(everything()))
#>   sex age num
#> 1   F  31   2
#> 2   F  31   2
#> 3   M  32   5
#> 4   M  32   5

Created on 2022-08-19 by the reprex package (v2.0.1.9000)

CodePudding user response:

A base R option using subset ave

> subset(file, ave(seq_along(num), sex, age, num, FUN = length) > 1)
  sex age num
1   M  32   5
2   F  31   2
7   F  31   2
9   M  32   5

or rbind split

> do.call(rbind, Filter(function(x) nrow(x) > 1, split(file, ~ sex   age   num)))
         sex age num
F.31.2.2   F  31   2
F.31.2.7   F  31   2
M.32.5.1   M  32   5
M.32.5.9   M  32   5

CodePudding user response:

Here is an approach, using .SD[.N>1] by group in data.table

library(data.table)
result = setDT(file)[, i:=.I][, .SD[.N>1],.(sex,age,num)][, i:=NULL]

Output:

   sex age num
1:   M  32   5
2:   M  32   5
3:   F  31   2
4:   F  31   2
  • Related