Identify repeatd characters across columns in r-CodePudding

In RStudio, I have a df of different character strings across different groups in columns. there are about 600 in each column and I am not sure if certain characters are repeated across all the columns/groups or just 2 or 3 columns. I was wondering if there is a way to make a new df with just the repeated character in each column, and in which column/groups they repeat in.

For example my df looks like this

Group1 Group2 Group3 Group4 Group5
AB      FG    SA     KD      CD
CD      ZX    AB     ER      ZX 
ED      QW    OI     SA      AB
GD      AS    ZX     QW      KD

Im not sure what the final df would look like; but I want to be able to identify which characters are repeated in which groups, and then make a figure to display that information. I hope that makes sense. Or how can I pick out which characters are repeated in two columns, then three the four columns, or in all 5 columns. Thank you.

CodePudding user response：

library(tidyverse)

data <- tribble(
  ~Group1, ~Group2, ~Group3, ~Group4, ~Group5,
  "AB", "FG", "SA", "KD", "CD",
  "CD", "ZX", "AB", "ER", "ZX",
  "ED", "QW", "OI", "SA", "AB",
  "GD", "AS", "ZX", "QW", "KD"
)


repeated_values <-
  data %>%
  pivot_longer(everything()) %>%
  group_by(value) %>%
  count() %>%
  filter(n >= 2) %>%
  pull(value)
repeated_values
#> [1] "AB" "CD" "KD" "QW" "SA" "ZX"

# in which rows are which repeated characters?
repeated_data <-
  data %>%
  mutate(row_id = row_number()) %>%
  pivot_longer(-row_id) %>%
  filter(value %in% repeated_values)
repeated_data
#> # A tibble: 14 x 3
#>    row_id name   value
#>     <int> <chr>  <chr>
#>  1      1 Group1 AB   
#>  2      1 Group3 SA   
#>  3      1 Group4 KD   
#>  4      1 Group5 CD   
#>  5      2 Group1 CD   
#>  6      2 Group2 ZX   
#>  7      2 Group3 AB   
#>  8      2 Group5 ZX   
#>  9      3 Group2 QW   
#> 10      3 Group4 SA   
#> 11      3 Group5 AB   
#> 12      4 Group3 ZX   
#> 13      4 Group4 QW   
#> 14      4 Group5 KD

# in how many rows are the repeated characters?
repeated_data %>%
  distinct(row_id, value) %>%
  count(value)
#> # A tibble: 6 x 2
#>   value     n
#>   <chr> <int>
#> 1 AB        3
#> 2 CD        2
#> 3 KD        2
#> 4 QW        2
#> 5 SA        2
#> 6 ZX        2

^{Created on 2021-11-11 by the reprex package (v2.0.1)}

CodePudding user response：

Here is an example of how to print out the Groups:

Data:

dat <- structure(list(Group1 = c("AB", "CD", "ED", "GD"), Group2 = c("FG", 
"ZX", "QW", "AS"), Group3 = c("SA", "AB", "OI", "ZX"), Group4 = c("KD", 
"ER", "SA", "QW"), Group5 = c("CD", "ZX", "AB", "KD")), class = "data.frame", row.names = c(NA, 
-4L))

dat
  Group1 Group2 Group3 Group4 Group5
1     AB     FG     SA     KD     CD
2     CD     ZX     AB     ER     ZX
3     ED     QW     OI     SA     AB
4     GD     AS     ZX     QW     KD

Get the number of repeats:

ta <- table(as.matrix(dat))

# all character strings
ta
AB AS CD ED ER FG GD KD OI QW SA ZX 
 3  1  2  1  1  1  1  2  1  2  2  3 

# only repeated
ta[ta > 1]
AB CD KD QW SA ZX 
 3  2  2  2  2  3

Populate a list of character vectors to get the groups:

sapply( names(table(as.matrix(dat))[table(as.matrix(dat)) > 1]),
  function(x) colnames(dat[grep(paste0("\\b",x,"\\b"), dat)]) )
$AB
[1] "Group1" "Group3" "Group5"
$CD
[1] "Group1" "Group5"
$KD
[1] "Group4" "Group5"
$QW
[1] "Group2" "Group4"
$SA
[1] "Group3" "Group4"
$ZX
[1] "Group2" "Group3" "Group5"

Showing all character strings, also the ones matching only a single time:

sapply( names(table(as.matrix(dat))),
  function(x) colnames(dat[grep(paste0("\\b",x,"\\b"), dat)]) )
$AB
[1] "Group1" "Group3" "Group5"
$AS
[1] "Group2"
$CD
[1] "Group1" "Group5"
$ED
[1] "Group1"
$ER
[1] "Group4"
$FG
[1] "Group2"
$GD
[1] "Group1"
$KD
[1] "Group4" "Group5"
$OI
[1] "Group3"
$QW
[1] "Group2" "Group4"
$SA
[1] "Group3" "Group4"
$ZX
[1] "Group2" "Group3" "Group5"

Also add the columns that match if you wish:

sapply( names(table(as.matrix(dat))[table(as.matrix(dat)) > 1]),
  function(x) dat[grep(paste0("\\b",x,"\\b"), dat)] )
$AB
  Group1 Group3 Group5
1     AB     SA     CD
2     CD     AB     ZX
3     ED     OI     AB
4     GD     ZX     KD
... etc