I'm trying to filter a dataframe in r, based on the fact that the row entries are the same on the first two out of three columns. What I have done is following (where df is a 3 column dataframe)
not_duplicate <- df[!(duplicated(dplyr::select(df, col1, col2))), ]
This takes quite some time. Is there a way to speed this up/better way to do this? Thanks!
CodePudding user response:
Try this data.table
option -
library(data.table)
setDT(df)
unique(df, by = c('col1', 'col2'))
CodePudding user response:
Does this make it faster?
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
iris %>% distinct(across(1:2), .keep_all = TRUE) %>% as_tibble()
#> # A tibble: 117 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 107 more rows
Created on 2021-10-05 by the reprex package (v2.0.1)
Speed comparisons
Using unique
and data.table
is the fastest way (so far):
library(microbenchmark)
library(tidyverse)
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#>
#> between, first, last
#> The following object is masked from 'package:purrr':
#>
#> transpose
df <- nycflights13::flights %>% rename(col1 = year, col2 = month)
microbenchmark(
question = {
df[!(duplicated(dplyr::select(df, col1, col2))), ]
},
Weed = {
df %>%
group_by(col1, col2) %>%
mutate(nrep = n()) %>%
filter(nrep == 1) %>%
ungroup()
},
danlooo = {
df %>% distinct(across(1:2), .keep_all = TRUE)
},
ronak = {
setDT(df)
unique(df, by = c('col1', 'col2'))
},
times = 10
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> question 4.059082 4.131558 79.471833 4.211303 6.153801 453.642332 10
#> Weed 17.702280 19.095162 21.956365 20.565925 26.074310 29.571600 10
#> danlooo 8.079582 8.133334 9.912199 8.686196 9.827751 15.959280 10
#> ronak 1.262237 1.386761 1.648035 1.454734 1.572879 3.419493 10
#> cld
#> a
#> a
#> a
#> a
Created on 2021-10-05 by the reprex package (v2.0.1)
CodePudding user response:
You can try
df %>%
group_by(col1, col2) %>%
mutate(nrep = n()) %>%
filter(nrep == 1) %>%
ungroup()