I have a very complex dataset where I am struggling to find with R, the joint occurrence through four main columns in their values' rows. To explain this better, I am providing the following example with the airquality
dataset, where just for simplicity purposes, I will show two main examples:
library(readxl)
library(tidyverse)
libra
library(data.table)
install.packages('hablar')
library(hablar)
#1st example
airquality %>%
select(Wind, Temp) %>%
find_duplicates(Wind, Temp)
#2nd example
airquality %>%
select(Wind, Month) %>%
find_duplicates(Wind, Month)
And so on. Just to avoid writing separate chunks of code to iterate the same operation for further columns, could you please suggest another iterative mode for the same results (for loops, map() functions, apply() functions). Could you please show as follows some examples?
Just to clarify the output I am looking for seems more a less to a table such as the one provided below:
# A tibble: 31 x 2
Wind Temp
<dbl> <int>
1 6.9 74
2 11.5 68
3 14.9 81
4 7.4 76
5 8 82
6 11.5 79
7 14.9 77
8 10.3 76
9 6.3 77
10 14.9 77
# ... with 21 more rows
But I would like to iterate the same result for many columns of interest
P.s. I know that instead of using function 'find_duplicates from the package ('hablar')
there are different. So you have any further suggestions, feel free to implement this in your solution. It will be very appreciated.
Thank you so much for paying attention
CodePudding user response:
The below will return you a list
of your expected data.frame
s
mycolumns= c("Wind", "Month", "Ozone", "Solar.R")
columnpairs <- as.data.frame(combn(mycolumns, 2))
result <- lapply(columnpairs, function(x)
airquality %>%
select(Wind, Temp) %>%
find_duplicates(Wind, Temp)
)
Regular dplyr
runs it 37% faster:
result_dplyr <- lapply(columnpairs, function(x) {
airquality %>%
select(all_of(x)) %>%
group_by(across(all_of(x))) %>% filter(n() > 1)
}
)
result_dplyr
$V1
# A tibble: 105 x 2
# Groups: Wind, Month [40]
Wind Month
<dbl> <int>
1 7.4 5
2 8 5
3 11.5 5
4 14.9 5
5 8.6 5
6 8.6 5
7 9.7 5
8 11.5 5
9 12 5
10 11.5 5
# ... with 95 more rows
$V2
# A tibble: 31 x 2
Benchmarking:
> microbenchmark::microbenchmark(
hablar = lapply(columnpairs, function(x) airquality %>% select(Wind, Temp) %>% find_duplicates(Wind, Temp)),
dplyr= lapply(columnpairs, function(x) {airquality %>% select(all_of(x)) %>% group_by(across(all_of(x))) %>% filter(n() > 1) })
)
Unit: milliseconds
expr min lq mean median uq max neval
hablar 123.6340 127.66205 132.8422 130.6805 133.7052 210.2259 100
dplyr 75.8762 81.19705 82.5891 82.2986 83.6467 98.0426 100