I have data that looks like
df <- data.frame(A = c("a, a, a, b, b, c, c", "a, a, b, b, b, b, c", "a, a, b, b"), B = c(3, 5, 8))
I want to find the most common word, separated by ,
for each observation of variable A
.
All approaches I have found only extract the most common word in the entire column, such as
table(unlist(strsplit(df$A,", "))) %>% which.max() %>% names()
and I get
wrong_result <- data.frame(A = c("a, a, a, b, b, c, c", "a, a, b, b, b, b, c"), B = c(3, 5, 8), C = c("b", "b", "b"))
If two words are equally frequent they should both be extracted. The result should look like
result <- data.frame(A = c("a, a, a, b, b, c, c", "a, a, b, b, b, b, c", "a, a, b, b"), B = c(3, 5, 8), C = c("a", "b", "a, b"))
CodePudding user response:
You can do:
library(dplyr)
library(stringr)
df %>%
mutate(maxi = map(str_split(A, pattern = ", "),
~ toString(names(which(table(.x) == max(table(.x)))))))
# A B maxi
#1 a, a, a, b, b, c, c 3 a
#2 a, a, b, b, b, b, c 5 b
#3 a, a, b, b 8 a, b
CodePudding user response:
A base
solution:
sapply(strsplit(df$A,", "), \(x) {
tab <- table(x)
toString(names(tab[tab == max(tab)]))
})
# [1] "a" "b" "a, b"
CodePudding user response:
Here's another solution, in tidyverse
:
library(tidyverse)
df %>%
# separate `A` into rows:
separate_rows(A) %>%
# for each combination of `B` and `A`...
group_by(B, A) %>%
# ... count the number of occurrence:
summarise(N = n()) %>%
# filter the maximum value(s):
filter(N == max(N)) %>%
# collapse the strings back together:
summarise(
C = str_c(A, collapse = ',')
) %>%
# select the new column `C`:
select(C) %>%
# bind this column back to the original `df`:
bind_cols(., df)
# A tibble: 3 × 3
C A B
<chr> <chr> <dbl>
1 a a, a, a, b, b, c, c 3
2 b a, a, b, b, b, b, c 5
3 a,b a, a, b, b 8
CodePudding user response:
Here is a base R solution with the new pipe operator introduced in R 4.2.0.
df <- data.frame(A = c("a, a, a, b, b, c, c", "a, a, b, b, b, b, c", "a, a, b, b"), B = c(3, 5, 8))
strsplit(df$A,", ") |>
lapply(table) |>
lapply(\(x) names(x[x == max(x)])) |>
sapply(toString)
#> [1] "a" "b" "a, b"
Created on 2022-07-23 by the reprex package (v2.0.1)