Say I have a data.frame:
file = read.table(text = "sex age num
M 32 5
F 31 2
M 91 2
M 30 1
M 23 1
F 19 1
F 31 2
F 21 2
M 32 5
F 65 3
M 24 5", header = T, sep = "")
I want to get a sorted data frame of all rows that have the exact same values of sex
, age
, and num
with any other row in the data frame.
The result should look like this (note that the data frame is sorted by the pairs or groups that are duplicated with each other):
result = read.table(text = "sex age num
M 32 5
M 32 5
F 31 2
F 31 2", header = T, sep = "")
I have tried various combinations of distinct
in dplyr and duplicated
, but they don't quite get at this use case.
CodePudding user response:
We need duplicated
twice i.e. one duplicated
in the normal direction from up to bottom and second from bottom to top (fromLast = TRUE
) and then use |
so that it can be TRUE in either direction for subsetting
out <- file[duplicated(file)|duplicated(file, fromLast = TRUE),]
out$sex <- factor(out$sex, levels = c("M", "F"))
out1 <- out[do.call(order, out),]
row.names(out1) <- NULL
-output
> out1
sex age num
1 M 32 5
2 M 32 5
3 F 31 2
4 F 31 2
The above can be written in tidyverse
library(dplyr)
file %>%
arrange(sex == "F", across(everything())) %>%
filter(duplicated(.)|duplicated(., fromLast = TRUE))
sex age num
1 M 32 5
2 M 32 5
3 F 31 2
4 F 31 2
CodePudding user response:
An alternative approach:
Here all groups with more then 1 nrow will be kept:
library(dplyr)
file %>%
group_by(sex, age, num) %>%
filter(n() > 1) %>%
arrange(.by_group = T)
ungroup()
sex age num
<chr> <int> <int>
1 F 31 2
2 F 31 2
3 M 32 5
4 M 32 5
CodePudding user response:
file = read.table(text = "sex age num
M 32 5
F 31 2
M 91 2
M 30 1
M 23 1
F 19 1
F 31 2
F 21 2
M 32 5
F 65 3
M 24 5", header = T, sep = "")
library(vctrs)
library(dplyr, warn = F)
#> Warning: package 'dplyr' was built under R version 4.1.2
file %>%
filter(vec_duplicate_detect(.)) %>%
arrange(across(everything()))
#> sex age num
#> 1 F 31 2
#> 2 F 31 2
#> 3 M 32 5
#> 4 M 32 5
Created on 2022-08-19 by the reprex package (v2.0.1.9000)
CodePudding user response:
A base R option using subset
ave
> subset(file, ave(seq_along(num), sex, age, num, FUN = length) > 1)
sex age num
1 M 32 5
2 F 31 2
7 F 31 2
9 M 32 5
or rbind
split
> do.call(rbind, Filter(function(x) nrow(x) > 1, split(file, ~ sex age num)))
sex age num
F.31.2.2 F 31 2
F.31.2.7 F 31 2
M.32.5.1 M 32 5
M.32.5.9 M 32 5
CodePudding user response:
Here is an approach, using .SD[.N>1]
by group in data.table
library(data.table)
result = setDT(file)[, i:=.I][, .SD[.N>1],.(sex,age,num)][, i:=NULL]
Output:
sex age num
1: M 32 5
2: M 32 5
3: F 31 2
4: F 31 2