Find most common word(s) in character string value-CodePudding

I have data that looks like

df <- data.frame(A = c("a, a, a, b, b, c, c", "a, a, b, b, b, b, c", "a, a, b, b"), B = c(3, 5, 8))

I want to find the most common word, separated by , for each observation of variable A.

All approaches I have found only extract the most common word in the entire column, such as

table(unlist(strsplit(df$A,", "))) %>% which.max() %>% names()

and I get

wrong_result <- data.frame(A = c("a, a, a, b, b, c, c", "a, a, b, b, b, b, c"), B = c(3, 5, 8), C = c("b", "b", "b"))

If two words are equally frequent they should both be extracted. The result should look like

result <- data.frame(A = c("a, a, a, b, b, c, c", "a, a, b, b, b, b, c", "a, a, b, b"), B = c(3, 5, 8), C = c("a", "b", "a, b"))

CodePudding user response：

You can do:

library(dplyr)
library(stringr)
df %>% 
  mutate(maxi = map(str_split(A, pattern = ", "), 
                    ~ toString(names(which(table(.x) == max(table(.x)))))))

#                    A B maxi
#1 a, a, a, b, b, c, c 3    a
#2 a, a, b, b, b, b, c 5    b
#3          a, a, b, b 8 a, b

CodePudding user response：

A base solution:

sapply(strsplit(df$A,", "), \(x) {
  tab <- table(x)
  toString(names(tab[tab == max(tab)]))
})

# [1] "a"    "b"    "a, b"

CodePudding user response：

Here's another solution, in tidyverse:

library(tidyverse)
df %>%
  # separate `A` into rows:
  separate_rows(A) %>%
  # for each combination of `B` and `A`...
  group_by(B, A) %>%
  # ... count the number of occurrence:
  summarise(N = n()) %>%
  # filter the maximum value(s):
  filter(N == max(N)) %>%
  # collapse the strings back together:
  summarise( 
            C = str_c(A, collapse = ',')
            ) %>%
  # select the new column `C`:
  select(C) %>%
  # bind this column back to the original `df`:
  bind_cols(., df)
# A tibble: 3 × 3
  C     A                       B
  <chr> <chr>               <dbl>
1 a     a, a, a, b, b, c, c     3
2 b     a, a, b, b, b, b, c     5
3 a,b   a, a, b, b              8

CodePudding user response：

Here is a base R solution with the new pipe operator introduced in R 4.2.0.

df <- data.frame(A = c("a, a, a, b, b, c, c", "a, a, b, b, b, b, c", "a, a, b, b"), B = c(3, 5, 8))

strsplit(df$A,", ") |>
  lapply(table) |>
  lapply(\(x) names(x[x == max(x)])) |>
  sapply(toString)
#> [1] "a"    "b"    "a, b"

^{Created on 2022-07-23 by the reprex package (v2.0.1)}