Home > Net >  Count how many times each character appears in the whole dataset
Count how many times each character appears in the whole dataset

Time:10-04

I have a table with twenty columns and thousands of rows.

Just for example purposes, I will say I have this table:

ColumnA   ColumnB
Testing      This
1231         1231

I want to count how many times each single character appears in the whole dataset.

So in our toy example we would have

character   nºoftimes
T                3
e                1
s                2
i                2
n                1
g                1
h                1
1                4
2                2
3                2

I was thinking of using some kind of string manipulation, but now sure how can I do this.

CodePudding user response:

You can use strsplit and table:

df <- data.frame(ColumnA=c('Testing', '1231'),
                 ColumnB=c('This', '1231'))

table(tolower(unlist(sapply(df, strsplit, ''))))
# 1 2 3 e g h i n s t 
# 4 2 2 1 1 1 2 1 2 3 

This does not distinguish between lowercase and uppercase letters – all are changed to lowercase. If you wanted to make that distinction remove the tolower() function.

CodePudding user response:

Does this work:

data.frame(table(strsplit(toupper(paste0(apply(df, 2, paste0, collapse = ''), collapse = '')), split = '')))
   Var1 Freq
1     1    4
2     2    2
3     3    2
4     E    1
5     G    1
6     H    1
7     I    2
8     N    1
9     S    2
10    T    3

CodePudding user response:

This is almost similar to other two answers (by Karthik and Robert) but

  1. this does not use apply family of functions and
  2. Uses pipe for better readability.

Base R -

df |> 
  as.matrix() |>
  strsplit('') |>
  unlist() |>
  tolower() |>
  table() |>
  stack() |>
  (\(d) setNames(d[2:1], c('character', 'count')))()

#   character count
#1          1     4
#2          2     2
#3          3     2
#4          e     1
#5          g     1
#6          h     1
#7          i     2
#8          n     1
#9          s     2
#10         t     3

And since you tagged tidyverse the same answer is written using tidyverse functions.

library(tidyverse)

df %>%
  as.matrix() %>%
  str_split('') %>%
  flatten_chr() %>%
  tolower() %>%
  table() %>%
  enframe(name = "character", value = "count") %>%
  mutate(count = as.numeric(count))

CodePudding user response:

Here's a tidyverse solution:

library(tidyverse)
df %>%
  pivot_longer(everything()) %>%
  separate_rows(value, sep = "(?<!^)(?!$)") %>%
  group_by(char = tolower(value)) %>%
  summarise(N = n())
# A tibble: 10 × 2
   char      N
   <chr> <int>
 1 1         4
 2 2         2
 3 3         2
 4 e         1
 5 g         1
 6 h         1
 7 i         2
 8 n         1
 9 s         2
10 t         3

CodePudding user response:

You can use tidytext:

library(tidytext)
library(tidyr)
library(dplyr)

df %>%
  pivot_longer(everything()) %>% 
  unnest_tokens(value, value, token = "characters") %>% 
  count(value)

output

# A tibble: 10 × 2
   value     n
   <chr> <int>
 1 1         4
 2 2         2
 3 3         2
 4 e         1
 5 g         1
 6 h         1
 7 i         2
 8 n         1
 9 s         2
10 t         3
  • Related