Home > Enterprise >  Add count of duplicate observations to data frame
Add count of duplicate observations to data frame

Time:12-02

I am trying to get all dupicated observations. I was looking but all solutions seems to give for columns. Is it possible get the entire rows?

My dataset looks like this

structure(list(CrimeId = c(160903280L, 160912272L, 160912590L, 
160912801L, 160912811L, 160913003L), OriginalCrimeTypeName = c("Assault / Battery", 
"Homeless Complaint", "Susp Info", "Report", "594", "Ref'd"), 
    OffenseDate = c("2016-03-30T00:00:00", "2016-03-31T00:00:00", 
    "2016-03-31T00:00:00", "2016-03-31T00:00:00", "2016-03-31T00:00:00", 
    "2016-03-31T00:00:00"), CallTime = c("18:42", "15:31", "16:49", 
    "17:38", "17:42", "18:29"), CallDateTime = c("2016-03-30T18:42:00", 
    "2016-03-31T15:31:00", "2016-03-31T16:49:00", "2016-03-31T17:38:00", 
    "2016-03-31T17:42:00", "2016-03-31T18:29:00"), Disposition = c("REP", 
    "GOA", "GOA", "GOA", "REP", "GOA"), Address = c("100 Block Of Chilton Av", 
    "2300 Block Of Market St", "2300 Block Of Market St", "500 Block Of 7th St", 
    "Beale St/bryant St", "16th St/pond St"), City = c("San Francisco", 
    "San Francisco", "San Francisco", "San Francisco", "San Francisco", 
    "San Francisco"), State = c("CA", "CA", "CA", "CA", "CA", 
    "CA"), AgencyId = c("1", "1", "1", "1", "1", "1"), Range = c(NA, 
    NA, NA, NA, NA, NA), AddressType = c("Premise Address", "Premise Address", 
    "Premise Address", "Premise Address", "Intersection", "Intersection"
    )), row.names = c(NA, 6L), class = "data.frame")

CodePudding user response:

With dplyr try group_by_all or the now recommended group_by(across(everything())) equivalent. Using a slightly extended data set where I created a duplicated entry (row 2 and 5).

library(dplyr)

df %>% 
  group_by(across(everything())) %>% 
  mutate(dup = n())
...AgencyId Range AddressType       dup
...  <chr>    <lgl> <chr>           <int>
...1 1        NA    Premise Address     1
...2 1        NA    Premise Address     2
...3 1        NA    Premise Address     1
...4 1        NA    Premise Address     1
...5 1        NA    Premise Address     2
...6 1        NA    Intersection        1
...7 1        NA    Intersection        1

(only showing the last 4 columns)

ext. data

df <- structure(list(CrimeId = c(160903280L, 160912272L, 160912590L,
160912801L, 160912272L, 160912811L, 160913003L), OriginalCrimeTypeName = c("Assault / Battery",
"Homeless Complaint", "Susp Info", "Report", "Homeless Complaint",
"594", "Ref'd"), OffenseDate = c("2016-03-30T00:00:00", "2016-03-31T00:00:00",
"2016-03-31T00:00:00", "2016-03-31T00:00:00", "2016-03-31T00:00:00",
"2016-03-31T00:00:00", "2016-03-31T00:00:00"), CallTime = c("18:42",
"15:31", "16:49", "17:38", "15:31", "17:42", "18:29"), CallDateTime = c("2016-03-30T18:42:00",
"2016-03-31T15:31:00", "2016-03-31T16:49:00", "2016-03-31T17:38:00",
"2016-03-31T15:31:00", "2016-03-31T17:42:00", "2016-03-31T18:29:00"
), Disposition = c("REP", "GOA", "GOA", "GOA", "GOA", "REP",
"GOA"), Address = c("100 Block Of Chilton Av", "2300 Block Of Market St",
"2300 Block Of Market St", "500 Block Of 7th St", "2300 Block Of Market St",
"Beale St/bryant St", "16th St/pond St"), City = c("San Francisco",
"San Francisco", "San Francisco", "San Francisco", "San Francisco",
"San Francisco", "San Francisco"), State = c("CA", "CA", "CA",
"CA", "CA", "CA", "CA"), AgencyId = c("1", "1", "1", "1", "1",
"1", "1"), Range = c(NA, NA, NA, NA, NA, NA, NA), AddressType = c("Premise Address",
"Premise Address", "Premise Address", "Premise Address", "Premise Address",
"Intersection", "Intersection")), row.names = c("1", "2", "3",
"4", "21", "5", "6"), class = "data.frame")

CodePudding user response:

With library(dplyr) you can do your_data %>% add_count(across(everything())) to add a count grouped by every column.

Demo:

mtcars[c(1, 1, 2, 3, 2, 3, 3), ] %>% 
  add_count(across(everything()))
#    mpg cyl disp  hp drat    wt  qsec vs am gear carb n
# 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 2
# 2 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 2
# 3 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 2
# 4 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 3
# 5 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 2
# 6 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 3
# 7 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 3
  • Related