How to check exact matches of tabulations in R?-CodePudding

Data

I have created this fake dataset for my example:

#### Library ####
library(tidyverse)

#### Create Five Random Binomial Distributions ####
x1 <- round(rbinom(n=1000,
             size=1,
             prob=.5))
x2 <- round(rbinom(n=1000,
                   size=1,
                   prob=.5))
x3 <- round(rbinom(n=1000,
                   size=1,
                   prob=.5))
x4 <- round(rbinom(n=1000,
                   size=1,
                   prob=.5))
x5 <- round(rbinom(n=1000,
                   size=1,
                   prob=.5))

#### Merge Into Tibble ####
df <- data.frame(x1,x2,x3,x4,x5)
tib <- as_tibble(df)
tib

Problem

Gathering the data and tabulating them after is fairly straightforward and easy for getting counts of all the variables:

tib %>% 
  gather() %>% 
  table() # looks like 8,9,15 are always the same

However, in a case where there are 100 variables in a dataset, this can be difficult to read. Additionally, I'm looking to see if there are exact matches in tabulations. For example, if X1 and X2 both have counts of n=0 and n=1 being the exact same:

   0  1
x1 40 1000
x2 40 1000

...I would like a way to flag these exact matches of counts without scrolling through miles of tabulations. Is there a way of achieving this?

CodePudding user response：

You could group the cross tabulation by the counts of 0 and 1, and flag thoes groups with more than one rows.

cross_tab <- tib %>% 
  pivot_longer(everything()) %>%
  pivot_wider(names_from = value, values_from = value, values_fn = length) %>%
  group_by(`0`, `1`) %>%
  mutate(flag =  (n() > 1)) %>%
  ungroup()

# # A tibble: 5 × 4
#   name    `0`   `1`  flag
#   <chr> <int> <int> <int>
# 1 x1      489   511     1
# 2 x2      490   510     0
# 3 x3      491   509     0
# 4 x4      495   505     0
# 5 x5      489   511     1

cross_tab %>%
  filter(flag == 1)

# # A tibble: 2 × 4
#   name    `0`   `1`  flag
#   <chr> <int> <int> <int>
# 1 x1      489   511     1
# 2 x5      489   511     1

CodePudding user response：

A base R solution with tib as the input, using combn:

nm <- combn(colnames(tib), m = 2, FUN = paste, collapse = "-")
comb <- combn(colnames(tib), m = 2, FUN = function(i) {
  identical(unname(table(tib[, i][1])), unname(table(tib[, i][2])))
}, simplify = TRUE)
setNames(comb, nm)

#x1-x2 x1-x3 x1-x4 x1-x5 x2-x3 x2-x4 x2-x5 x3-x4 x3-x5 x4-x5 
# TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

And to get only the TRUE values:

v <- setNames(comb, nm)
v[v]
#x1-x2 
# TRUE

CodePudding user response：

Suppose you had the following table, which contains two rows with identical frequencies:

tab
#>       0    1
#> x1   40 1000
#> x2   40 1000
#> x3  100  800
#> x4  120  900

Then you could do

tab %>%
  as.data.frame() %>%
  setNames(c('var', 'value', 'n')) %>%
  group_by(value, n) %>%
  summarize(vars = paste(var, collapse = ', '), .groups = 'drop')
#> # A tibble: 6 x 3
#>   value     n vars  
#>   <fct> <dbl> <chr> 
#> 1 0        40 x1, x2
#> 2 0       100 x3    
#> 3 0       120 x4    
#> 4 1       800 x3    
#> 5 1       900 x4    
#> 6 1      1000 x1, x2

^{Created on 2022-09-29 with reprex v2.0.2}

Data

tab <- structure(c(40, 40, 100, 120, 1000, 1000, 800, 900), .Dim = c(4L, 2L),
          .Dimnames = list(c("x1", "x2", "x3", "x4"), c("0", "1")), 
          class = "table")

CodePudding user response：

Despite the already accepted answer, here's another option that builds on my answer here

# Form the summary table
counts <- tib %>% summarise(across(starts_with("x"), sum)) 
# Collate all combinations of pairs of columns
as_tibble(t(combn(names(tib), 2))) %>% 
  rowwise() %>% 
  group_map(
    function(.x, .y) {
      tibble(
        Col1=.x$V1, 
        Col2=.x$V2, 
        Val1=counts[[.x$V1]], 
        Val2=counts[[.x$V2]], 
        Match=Val1==Val2
      )
    }
  ) %>% 
  bind_rows()
# A tibble: 10 × 5
   Col1  Col2   Val1  Val2 Match
   <chr> <chr> <dbl> <dbl> <lgl>
 1 x1    x2      493   528 FALSE
 2 x1    x3      493   472 FALSE
 3 x1    x4      493   516 FALSE
 4 x1    x5      493   481 FALSE
 5 x2    x3      528   472 FALSE
 6 x2    x4      528   516 FALSE
 7 x2    x5      528   481 FALSE
 8 x3    x4      472   516 FALSE
 9 x3    x5      472   481 FALSE
10 x4    x5      516   481 FALSE