Home > Software design >  Display a table of strings and their variations per row (R)
Display a table of strings and their variations per row (R)

Time:10-03

For a large database, I would like to find a solution where I could predefine the strings to be searched and then get a table that would contain the frequency of these strings and their possible variations per row.

strings <- c("dog", "cat", "mouse")

var1 <- c("black dog", "white dog", "angry dog", "dogs and cats are nice", "dog")
var2 <- c("white cat", "black cat", "tiny cat", NA, "cow")
var3 <- c("little mouse", "big mouse", NA, NA, "mouse")
data <- data.frame(var1, var2, var3)

The result should look like this while I am looking for dog, cat and mouse:

dog&cat 4
mouse 3

CodePudding user response:

We may try

v1 <- do.call(paste, data)
stack(setNames(lapply(c( "\\bdog.*\\bcat|\\bcat.*\\bdog", "mouse"), 
    \(pat) sum(grepl(pat, v1))), c("dog&cat", "mouse")))[2:1]
      ind values
1 dog&cat      4
2   mouse      3

Or if we need all the combinations

lst1 <- lapply(c(strings, combn(strings, 2, FUN = \(x) 
   sprintf("\\b%1$s.*\\b%2$s|\\b%2$s.*\\b%1$s", x[1], x[2]))), 
    \(pat) sum(grepl(pat, v1)))
names(lst1) <- c(strings, combn(strings, 2, FUN = paste, collapse = "&"))
stack(lst1)[2:1]
        ind values
1       dog      5
2       cat      4
3     mouse      3
4   dog&cat      4
5 dog&mouse      3
6 cat&mouse      2

For more combinations, it may be better to use Reduce with individually applying grepl

lst1 <- lapply(1:3, \(n) {
   vals <- colSums(combn(strings, n, 
  FUN = \(pats) Reduce(`&`, lapply(pats, \(pat) grepl(pat, v1)))))
   nms <- combn(strings, n, FUN = paste, collapse = "&")
   setNames(vals, nms)
   })
stack(unlist(lst1))[2:1]
            ind values
1           dog      5
2           cat      4
3         mouse      3
4       dog&cat      4
5     dog&mouse      3
6     cat&mouse      2
7 dog&cat&mouse      2

Or with tidyverse

library(dplyr)
library(stringr)
library(tidyr)
data %>% 
  unite(var, everything(), na.rm = TRUE, sep = " ") %>% 
  summarise(`dog&cat` = sum(str_detect(var,
   "\\bdog.*\\bcat|\\bcat.*\\bdog")),
     mouse = sum(str_detect(var, 'mouse'))) %>%   
  pivot_longer(everything())

-output

 # A tibble: 2 × 2
  name    value
  <chr>   <int>
1 dog&cat     4
2 mouse       3
  • Related