Count all duplicated values in R-CodePudding

I have the following vector called x:

x <- c(1, 1, 4, 5, 4, 6, 1, 1)
x
#> [1] 1 1 4 5 4 6 1 1

I would like to count all values that are duplicated values. In this case, the numbers 1,1,1,1,4,4 are duplicates which means a total of 6 duplicated values. Here are some tries:

x <- c(1, 1, 4, 5, 4, 6, 1, 1)
# Wrong outputs
sum(duplicated(x))
#> [1] 4
sum(table(x)-1)
#> [1] 4
# Returns number of duplicated values in this case 1 and 4
nrow(data.frame(table(x))[data.frame(table(x))$Freq > 1,])
#> [1] 2

^{Created on 2022-12-08 with reprex v2.0.2}

So I was wondering if anyone knows how to calculate all the duplicates instead of counting the number of values that have duplicates?

CodePudding user response：

Other options:

sum(Filter(\(z) z > 1, table(x)))
sum(setdiff(table(x), 1L))
sum(x %in% x[duplicated(x)])

The last is clearly the fastest, akrun's is a close second:

bench::mark(
  sum(Filter(\(z) z > 1, table(x))),
  sum(setdiff(table(x), 1L)),
  sum(x %in% x[duplicated(x)]),
  sum(table(x)[names(table(x)) %in% x[duplicated(x)]]),
  sum(duplicated(x)|duplicated(x, fromLast = TRUE))
)
# # A tibble: 5 x 13
#   expression                                                min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result    memory              time                gc                   
#   <bch:expr>                                           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>    <list>              <list>              <list>               
# 1 sum(Filter(function(z) z > 1, table(x)))                 58us   67.5us    14335.    5.35KB     6.62  6499     3    453.4ms <int [1]> <Rprofmem [16 x 3]> <bench_tm [6,502]>  <tibble [6,502 x 3]> 
# 2 sum(setdiff(table(x), 1L))                             51.6us   60.9us    16046.        0B     6.56  7338     3    457.3ms <int [1]> <Rprofmem [0 x 3]>  <bench_tm [7,341]>  <tibble [7,341 x 3]> 
# 3 sum(x %in% x[duplicated(x)])                            2.8us    3.2us   294065.        0B     0    10000     0       34ms <int [1]> <Rprofmem [0 x 3]>  <bench_tm [10,000]> <tibble [10,000 x 3]>
# 4 sum(table(x)[names(table(x)) %in% x[duplicated(x)]])  102.1us  123.4us     7957.        0B     4.26  3737     2    469.6ms <int [1]> <Rprofmem [0 x 3]>  <bench_tm [3,739]>  <tibble [3,739 x 3]> 
# 5 sum(duplicated(x) | duplicated(x, fromLast = TRUE))     4.3us    4.9us   194347.        0B    19.4   9999     1     51.4ms <int [1]> <Rprofmem [0 x 3]>  <bench_tm [10,000]> <tibble [10,000 x 3]>

(Disclaimer: profiling code with data this small is really a fool's errand ... but I was curious.)

CodePudding user response：

We can use duplicated twice ie. from forward as well as reverse so that all the duplicates are covered

sum(duplicated(x)|duplicated(x, fromLast = TRUE))
[1] 6

CodePudding user response：

Alternative way to calculate. Count the duplicated values (1, 1, 1, 4), and count the number of duplicates values (1, 4).

sum(duplicated(x), length(unique(x[duplicated(x)])))
# 6

CodePudding user response：

When writing this question I found an option with table and sum the number of values of the names which have duplicates from that table like this:

x <- c(1, 1, 4, 5, 4, 6, 1, 1)
sum(table(x)[names(table(x)) %in% x[duplicated(x)]])
#> [1] 6

^{Created on 2022-12-08 with reprex v2.0.2}

I assume there should be a better option without using table function.