I have a large dataset, and about 10% of it is "double coded". One research assistant re-collected data on a portion of the data so we can make sure it is accurate. Mostly, I want to check for spelling errors and other discrepancies.
I just want to pull out the double coded rows in to a new data frame so I can read through them to make sure they match up, then remove the duplicate rows.
I can identify the duplicate data based on 4 ID columns (Link, BillType, BillNumber, Name). I know how to identify duplicate rows and remove duplicates based on a certain number of columns, but how could I make a dataset of the duplicates?
This is how I can drop the duplicate rows:
FullData <- FullData %>%
distinct(Link, BillType, BillNumber, Name, .keep_all = TRUE)
CodePudding user response:
We can use dplyr::anti_join
.
library(dplyr)
FullData %>%
distinct(Link,
BillType,
BillNumber,
Name,
.keep_all = TRUE) %>%
anti_join(FullData)
CodePudding user response:
One way is to group by variables and use group_rows
as helper to access the duplicates (lengths(grps) > 1
).
Example
df
one two three four five six
1 5 5 6 2 6 6
2 4 10 1 10 8 9
3 2 7 6 2 6 9
Choose columns three, four and five to looks for duplicates.
library(dplyr)
grps <- df %>%
group_by(across(three:five)) %>%
group_rows
df[sapply(grps[lengths(grps) > 1], c), ]
one two three four five six
1 5 5 6 2 6 6
3 2 7 6 2 6 9
Data
df <- structure(list(one = c(5, 4, 2), two = c(5, 10, 7), three = c(6,
1, 6), four = c(2, 10, 2), five = c(6, 8, 6), six = c(6, 9, 9
)), class = "data.frame", row.names = c(NA, -3L))