Home > Back-end >  Some examples of iterative functions to find join occurence in columns' dataset
Some examples of iterative functions to find join occurence in columns' dataset

Time:08-08

I have a very complex dataset where I am struggling to find with R, the joint occurrence through four main columns in their values' rows. To explain this better, I am providing the following example with the airquality dataset, where just for simplicity purposes, I will show two main examples:

library(readxl)
library(tidyverse)
libra
library(data.table)
install.packages('hablar')
library(hablar)

#1st example 
airquality %>% 
  select(Wind, Temp) %>% 
  find_duplicates(Wind, Temp)

#2nd example 
airquality %>% 
  select(Wind, Month) %>% 
  find_duplicates(Wind, Month)

And so on. Just to avoid writing separate chunks of code to iterate the same operation for further columns, could you please suggest another iterative mode for the same results (for loops, map() functions, apply() functions). Could you please show as follows some examples?

Just to clarify the output I am looking for seems more a less to a table such as the one provided below:

# A tibble: 31 x 2
    Wind  Temp
   <dbl> <int>
 1   6.9    74
 2  11.5    68
 3  14.9    81
 4   7.4    76
 5   8      82
 6  11.5    79
 7  14.9    77
 8  10.3    76
 9   6.3    77
10  14.9    77
# ... with 21 more rows

But I would like to iterate the same result for many columns of interest

P.s. I know that instead of using function 'find_duplicates from the package ('hablar') there are different. So you have any further suggestions, feel free to implement this in your solution. It will be very appreciated.

Thank you so much for paying attention

CodePudding user response:

The below will return you a list of your expected data.frames

mycolumns= c("Wind", "Month", "Ozone", "Solar.R")

columnpairs <- as.data.frame(combn(mycolumns, 2))

result <- lapply(columnpairs, function(x) 
  airquality %>% 
    select(Wind, Temp) %>% 
    find_duplicates(Wind, Temp)
  )

Regular dplyr runs it 37% faster:

result_dplyr <- lapply(columnpairs, function(x) {
  airquality %>% 
    select(all_of(x)) %>% 
    group_by(across(all_of(x))) %>% filter(n() > 1)
}
)
result_dplyr
$V1
# A tibble: 105 x 2
# Groups:   Wind, Month [40]
    Wind Month
   <dbl> <int>
 1   7.4     5
 2   8       5
 3  11.5     5
 4  14.9     5
 5   8.6     5
 6   8.6     5
 7   9.7     5
 8  11.5     5
 9  12       5
10  11.5     5
# ... with 95 more rows

$V2
# A tibble: 31 x 2

Benchmarking:

> microbenchmark::microbenchmark(
    hablar = lapply(columnpairs, function(x) airquality %>% select(Wind, Temp) %>% find_duplicates(Wind, Temp)),
    dplyr= lapply(columnpairs, function(x) {airquality %>% select(all_of(x)) %>% group_by(across(all_of(x))) %>% filter(n() > 1) })
  )
Unit: milliseconds
   expr      min        lq     mean   median       uq      max neval
 hablar 123.6340 127.66205 132.8422 130.6805 133.7052 210.2259   100
  dplyr  75.8762  81.19705  82.5891  82.2986  83.6467  98.0426   100
  • Related