In RStudio, I have a df of different character strings across different groups in columns. there are about 600 in each column and I am not sure if certain characters are repeated across all the columns/groups or just 2 or 3 columns. I was wondering if there is a way to make a new df with just the repeated character in each column, and in which column/groups they repeat in.
For example my df looks like this
Group1 Group2 Group3 Group4 Group5
AB FG SA KD CD
CD ZX AB ER ZX
ED QW OI SA AB
GD AS ZX QW KD
Im not sure what the final df would look like; but I want to be able to identify which characters are repeated in which groups, and then make a figure to display that information. I hope that makes sense. Or how can I pick out which characters are repeated in two columns, then three the four columns, or in all 5 columns. Thank you.
CodePudding user response:
library(tidyverse)
data <- tribble(
~Group1, ~Group2, ~Group3, ~Group4, ~Group5,
"AB", "FG", "SA", "KD", "CD",
"CD", "ZX", "AB", "ER", "ZX",
"ED", "QW", "OI", "SA", "AB",
"GD", "AS", "ZX", "QW", "KD"
)
repeated_values <-
data %>%
pivot_longer(everything()) %>%
group_by(value) %>%
count() %>%
filter(n >= 2) %>%
pull(value)
repeated_values
#> [1] "AB" "CD" "KD" "QW" "SA" "ZX"
# in which rows are which repeated characters?
repeated_data <-
data %>%
mutate(row_id = row_number()) %>%
pivot_longer(-row_id) %>%
filter(value %in% repeated_values)
repeated_data
#> # A tibble: 14 x 3
#> row_id name value
#> <int> <chr> <chr>
#> 1 1 Group1 AB
#> 2 1 Group3 SA
#> 3 1 Group4 KD
#> 4 1 Group5 CD
#> 5 2 Group1 CD
#> 6 2 Group2 ZX
#> 7 2 Group3 AB
#> 8 2 Group5 ZX
#> 9 3 Group2 QW
#> 10 3 Group4 SA
#> 11 3 Group5 AB
#> 12 4 Group3 ZX
#> 13 4 Group4 QW
#> 14 4 Group5 KD
# in how many rows are the repeated characters?
repeated_data %>%
distinct(row_id, value) %>%
count(value)
#> # A tibble: 6 x 2
#> value n
#> <chr> <int>
#> 1 AB 3
#> 2 CD 2
#> 3 KD 2
#> 4 QW 2
#> 5 SA 2
#> 6 ZX 2
Created on 2021-11-11 by the reprex package (v2.0.1)
CodePudding user response:
Here is an example of how to print out the Groups:
Data:
dat <- structure(list(Group1 = c("AB", "CD", "ED", "GD"), Group2 = c("FG",
"ZX", "QW", "AS"), Group3 = c("SA", "AB", "OI", "ZX"), Group4 = c("KD",
"ER", "SA", "QW"), Group5 = c("CD", "ZX", "AB", "KD")), class = "data.frame", row.names = c(NA,
-4L))
dat
Group1 Group2 Group3 Group4 Group5
1 AB FG SA KD CD
2 CD ZX AB ER ZX
3 ED QW OI SA AB
4 GD AS ZX QW KD
- Get the number of repeats:
ta <- table(as.matrix(dat))
# all character strings
ta
AB AS CD ED ER FG GD KD OI QW SA ZX
3 1 2 1 1 1 1 2 1 2 2 3
# only repeated
ta[ta > 1]
AB CD KD QW SA ZX
3 2 2 2 2 3
- Populate a list of character vectors to get the groups:
sapply( names(table(as.matrix(dat))[table(as.matrix(dat)) > 1]),
function(x) colnames(dat[grep(paste0("\\b",x,"\\b"), dat)]) )
$AB
[1] "Group1" "Group3" "Group5"
$CD
[1] "Group1" "Group5"
$KD
[1] "Group4" "Group5"
$QW
[1] "Group2" "Group4"
$SA
[1] "Group3" "Group4"
$ZX
[1] "Group2" "Group3" "Group5"
- Showing all character strings, also the ones matching only a single time:
sapply( names(table(as.matrix(dat))),
function(x) colnames(dat[grep(paste0("\\b",x,"\\b"), dat)]) )
$AB
[1] "Group1" "Group3" "Group5"
$AS
[1] "Group2"
$CD
[1] "Group1" "Group5"
$ED
[1] "Group1"
$ER
[1] "Group4"
$FG
[1] "Group2"
$GD
[1] "Group1"
$KD
[1] "Group4" "Group5"
$OI
[1] "Group3"
$QW
[1] "Group2" "Group4"
$SA
[1] "Group3" "Group4"
$ZX
[1] "Group2" "Group3" "Group5"
- Also add the columns that match if you wish:
sapply( names(table(as.matrix(dat))[table(as.matrix(dat)) > 1]),
function(x) dat[grep(paste0("\\b",x,"\\b"), dat)] )
$AB
Group1 Group3 Group5
1 AB SA CD
2 CD AB ZX
3 ED OI AB
4 GD ZX KD
... etc