I'm trying to filter a dataframe in r, based on the fact that the row entries are the same on the first two out of three columns. What I have done is following (where df is a 3 column dataframe)

not_duplicate <- df[!(duplicated(dplyr::select(df, col1, col2))), ]

This takes quite some time. Is there a way to speed this up/better way to do this? Thanks!

CodePudding user response：

Try this data.table option -

library(data.table)

setDT(df)
unique(df, by = c('col1', 'col2'))

CodePudding user response：

Does this make it faster?

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
iris %>% distinct(across(1:2), .keep_all = TRUE) %>% as_tibble()
#> # A tibble: 117 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # … with 107 more rows

^{Created on 2021-10-05 by the reprex package (v2.0.1)}

Speed comparisons

Using unique and data.table is the fastest way (so far):

library(microbenchmark)
library(tidyverse)
library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
#> The following object is masked from 'package:purrr':
#> 
#>     transpose

df <- nycflights13::flights %>% rename(col1 = year, col2 = month)

microbenchmark(
  question = {
    df[!(duplicated(dplyr::select(df, col1, col2))), ]
  },
  Weed = {
    df %>% 
      group_by(col1, col2) %>% 
      mutate(nrep = n()) %>% 
      filter(nrep == 1) %>% 
      ungroup() 
  },
  danlooo = {
    df %>% distinct(across(1:2), .keep_all = TRUE)
  },
  ronak = {
    setDT(df)
    unique(df, by = c('col1', 'col2'))
  },
  times = 10
)
#> Unit: milliseconds
#>      expr       min        lq      mean    median        uq        max neval
#>  question  4.059082  4.131558 79.471833  4.211303  6.153801 453.642332    10
#>      Weed 17.702280 19.095162 21.956365 20.565925 26.074310  29.571600    10
#>   danlooo  8.079582  8.133334  9.912199  8.686196  9.827751  15.959280    10
#>     ronak  1.262237  1.386761  1.648035  1.454734  1.572879   3.419493    10
#>  cld
#>    a
#>    a
#>    a
#>    a

^{Created on 2021-10-05 by the reprex package (v2.0.1)}

CodePudding user response：

You can try

df %>% 
  group_by(col1, col2) %>% 
  mutate(nrep = n()) %>% 
  filter(nrep == 1) %>% 
  ungroup()