Data
I have created this fake dataset for my example:
#### Library ####
library(tidyverse)
#### Create Five Random Binomial Distributions ####
x1 <- round(rbinom(n=1000,
size=1,
prob=.5))
x2 <- round(rbinom(n=1000,
size=1,
prob=.5))
x3 <- round(rbinom(n=1000,
size=1,
prob=.5))
x4 <- round(rbinom(n=1000,
size=1,
prob=.5))
x5 <- round(rbinom(n=1000,
size=1,
prob=.5))
#### Merge Into Tibble ####
df <- data.frame(x1,x2,x3,x4,x5)
tib <- as_tibble(df)
tib
Problem
Gathering the data and tabulating them after is fairly straightforward and easy for getting counts of all the variables:
tib %>%
gather() %>%
table() # looks like 8,9,15 are always the same
However, in a case where there are 100 variables in a dataset, this can be difficult to read. Additionally, I'm looking to see if there are exact matches in tabulations. For example, if X1 and X2 both have counts of n=0 and n=1 being the exact same:
0 1
x1 40 1000
x2 40 1000
...I would like a way to flag these exact matches of counts without scrolling through miles of tabulations. Is there a way of achieving this?
CodePudding user response:
You could group the cross tabulation by the counts of 0 and 1, and flag thoes groups with more than one rows.
cross_tab <- tib %>%
pivot_longer(everything()) %>%
pivot_wider(names_from = value, values_from = value, values_fn = length) %>%
group_by(`0`, `1`) %>%
mutate(flag = (n() > 1)) %>%
ungroup()
# # A tibble: 5 × 4
# name `0` `1` flag
# <chr> <int> <int> <int>
# 1 x1 489 511 1
# 2 x2 490 510 0
# 3 x3 491 509 0
# 4 x4 495 505 0
# 5 x5 489 511 1
cross_tab %>%
filter(flag == 1)
# # A tibble: 2 × 4
# name `0` `1` flag
# <chr> <int> <int> <int>
# 1 x1 489 511 1
# 2 x5 489 511 1
CodePudding user response:
A base R solution with tib
as the input, using combn
:
nm <- combn(colnames(tib), m = 2, FUN = paste, collapse = "-")
comb <- combn(colnames(tib), m = 2, FUN = function(i) {
identical(unname(table(tib[, i][1])), unname(table(tib[, i][2])))
}, simplify = TRUE)
setNames(comb, nm)
#x1-x2 x1-x3 x1-x4 x1-x5 x2-x3 x2-x4 x2-x5 x3-x4 x3-x5 x4-x5
# TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
And to get only the TRUE values:
v <- setNames(comb, nm)
v[v]
#x1-x2
# TRUE
CodePudding user response:
Suppose you had the following table, which contains two rows with identical frequencies:
tab
#> 0 1
#> x1 40 1000
#> x2 40 1000
#> x3 100 800
#> x4 120 900
Then you could do
tab %>%
as.data.frame() %>%
setNames(c('var', 'value', 'n')) %>%
group_by(value, n) %>%
summarize(vars = paste(var, collapse = ', '), .groups = 'drop')
#> # A tibble: 6 x 3
#> value n vars
#> <fct> <dbl> <chr>
#> 1 0 40 x1, x2
#> 2 0 100 x3
#> 3 0 120 x4
#> 4 1 800 x3
#> 5 1 900 x4
#> 6 1 1000 x1, x2
Created on 2022-09-29 with reprex v2.0.2
Data
tab <- structure(c(40, 40, 100, 120, 1000, 1000, 800, 900), .Dim = c(4L, 2L),
.Dimnames = list(c("x1", "x2", "x3", "x4"), c("0", "1")),
class = "table")
CodePudding user response:
Despite the already accepted answer, here's another option that builds on my answer here
# Form the summary table
counts <- tib %>% summarise(across(starts_with("x"), sum))
# Collate all combinations of pairs of columns
as_tibble(t(combn(names(tib), 2))) %>%
rowwise() %>%
group_map(
function(.x, .y) {
tibble(
Col1=.x$V1,
Col2=.x$V2,
Val1=counts[[.x$V1]],
Val2=counts[[.x$V2]],
Match=Val1==Val2
)
}
) %>%
bind_rows()
# A tibble: 10 × 5
Col1 Col2 Val1 Val2 Match
<chr> <chr> <dbl> <dbl> <lgl>
1 x1 x2 493 528 FALSE
2 x1 x3 493 472 FALSE
3 x1 x4 493 516 FALSE
4 x1 x5 493 481 FALSE
5 x2 x3 528 472 FALSE
6 x2 x4 528 516 FALSE
7 x2 x5 528 481 FALSE
8 x3 x4 472 516 FALSE
9 x3 x5 472 481 FALSE
10 x4 x5 516 481 FALSE