Home > Net >  How to create new dataframe of partially duplicate rows (duplicates based on 4 columns out of 45 col
How to create new dataframe of partially duplicate rows (duplicates based on 4 columns out of 45 col

Time:10-10

I have a large dataset, and about 10% of it is "double coded". One research assistant re-collected data on a portion of the data so we can make sure it is accurate. Mostly, I want to check for spelling errors and other discrepancies.

I just want to pull out the double coded rows in to a new data frame so I can read through them to make sure they match up, then remove the duplicate rows.

I can identify the duplicate data based on 4 ID columns (Link, BillType, BillNumber, Name). I know how to identify duplicate rows and remove duplicates based on a certain number of columns, but how could I make a dataset of the duplicates?

This is how I can drop the duplicate rows:

FullData <- FullData %>% 
  distinct(Link, BillType, BillNumber, Name, .keep_all = TRUE)

CodePudding user response:

We can use dplyr::anti_join.

library(dplyr)

FullData %>% 
    distinct(Link,
             BillType,
             BillNumber,
             Name,
             .keep_all = TRUE) %>%
    anti_join(FullData)

CodePudding user response:

One way is to group by variables and use group_rows as helper to access the duplicates (lengths(grps) > 1).

Example

df
  one two three four five six
1   5   5     6    2    6   6
2   4  10     1   10    8   9
3   2   7     6    2    6   9

Choose columns three, four and five to looks for duplicates.

library(dplyr)

grps <- df %>% 
  group_by(across(three:five)) %>% 
  group_rows

df[sapply(grps[lengths(grps) > 1], c), ]
  one two three four five six
1   5   5     6    2    6   6
3   2   7     6    2    6   9

Data

df <- structure(list(one = c(5, 4, 2), two = c(5, 10, 7), three = c(6, 
1, 6), four = c(2, 10, 2), five = c(6, 8, 6), six = c(6, 9, 9
)), class = "data.frame", row.names = c(NA, -3L))
  • Related