Count how many times each character appears in the whole dataset-CodePudding

I have a table with twenty columns and thousands of rows.

Just for example purposes, I will say I have this table:

ColumnA   ColumnB
Testing      This
1231         1231

I want to count how many times each single character appears in the whole dataset.

So in our toy example we would have

character   nºoftimes
T                3
e                1
s                2
i                2
n                1
g                1
h                1
1                4
2                2
3                2

I was thinking of using some kind of string manipulation, but now sure how can I do this.

CodePudding user response：

You can use strsplit and table:

df <- data.frame(ColumnA=c('Testing', '1231'),
                 ColumnB=c('This', '1231'))

table(tolower(unlist(sapply(df, strsplit, ''))))
# 1 2 3 e g h i n s t 
# 4 2 2 1 1 1 2 1 2 3

This does not distinguish between lowercase and uppercase letters – all are changed to lowercase. If you wanted to make that distinction remove the tolower() function.

CodePudding user response：

Does this work:

data.frame(table(strsplit(toupper(paste0(apply(df, 2, paste0, collapse = ''), collapse = '')), split = '')))
   Var1 Freq
1     1    4
2     2    2
3     3    2
4     E    1
5     G    1
6     H    1
7     I    2
8     N    1
9     S    2
10    T    3

CodePudding user response：

This is almost similar to other two answers (by Karthik and Robert) but

this does not use apply family of functions and
Uses pipe for better readability.

Base R -

df |> 
  as.matrix() |>
  strsplit('') |>
  unlist() |>
  tolower() |>
  table() |>
  stack() |>
  (\(d) setNames(d[2:1], c('character', 'count')))()

#   character count
#1          1     4
#2          2     2
#3          3     2
#4          e     1
#5          g     1
#6          h     1
#7          i     2
#8          n     1
#9          s     2
#10         t     3

And since you tagged tidyverse the same answer is written using tidyverse functions.

library(tidyverse)

df %>%
  as.matrix() %>%
  str_split('') %>%
  flatten_chr() %>%
  tolower() %>%
  table() %>%
  enframe(name = "character", value = "count") %>%
  mutate(count = as.numeric(count))

CodePudding user response：

Here's a tidyverse solution:

library(tidyverse)
df %>%
  pivot_longer(everything()) %>%
  separate_rows(value, sep = "(?<!^)(?!$)") %>%
  group_by(char = tolower(value)) %>%
  summarise(N = n())
# A tibble: 10 × 2
   char      N
   <chr> <int>
 1 1         4
 2 2         2
 3 3         2
 4 e         1
 5 g         1
 6 h         1
 7 i         2
 8 n         1
 9 s         2
10 t         3

CodePudding user response：

You can use tidytext:

library(tidytext)
library(tidyr)
library(dplyr)

df %>%
  pivot_longer(everything()) %>% 
  unnest_tokens(value, value, token = "characters") %>% 
  count(value)

output

# A tibble: 10 × 2
   value     n
   <chr> <int>
 1 1         4
 2 2         2
 3 3         2
 4 e         1
 5 g         1
 6 h         1
 7 i         2
 8 n         1
 9 s         2
10 t         3