I have a table with twenty columns and thousands of rows.
Just for example purposes, I will say I have this table:
ColumnA ColumnB
Testing This
1231 1231
I want to count how many times each single character appears in the whole dataset.
So in our toy example we would have
character nºoftimes
T 3
e 1
s 2
i 2
n 1
g 1
h 1
1 4
2 2
3 2
I was thinking of using some kind of string manipulation, but now sure how can I do this.
CodePudding user response:
You can use strsplit
and table
:
df <- data.frame(ColumnA=c('Testing', '1231'),
ColumnB=c('This', '1231'))
table(tolower(unlist(sapply(df, strsplit, ''))))
# 1 2 3 e g h i n s t
# 4 2 2 1 1 1 2 1 2 3
This does not distinguish between lowercase and uppercase letters – all are changed to lowercase. If you wanted to make that distinction remove the tolower()
function.
CodePudding user response:
Does this work:
data.frame(table(strsplit(toupper(paste0(apply(df, 2, paste0, collapse = ''), collapse = '')), split = '')))
Var1 Freq
1 1 4
2 2 2
3 3 2
4 E 1
5 G 1
6 H 1
7 I 2
8 N 1
9 S 2
10 T 3
CodePudding user response:
This is almost similar to other two answers (by Karthik and Robert) but
- this does not use apply family of functions and
- Uses pipe for better readability.
Base R -
df |>
as.matrix() |>
strsplit('') |>
unlist() |>
tolower() |>
table() |>
stack() |>
(\(d) setNames(d[2:1], c('character', 'count')))()
# character count
#1 1 4
#2 2 2
#3 3 2
#4 e 1
#5 g 1
#6 h 1
#7 i 2
#8 n 1
#9 s 2
#10 t 3
And since you tagged tidyverse
the same answer is written using tidyverse
functions.
library(tidyverse)
df %>%
as.matrix() %>%
str_split('') %>%
flatten_chr() %>%
tolower() %>%
table() %>%
enframe(name = "character", value = "count") %>%
mutate(count = as.numeric(count))
CodePudding user response:
Here's a tidyverse solution:
library(tidyverse)
df %>%
pivot_longer(everything()) %>%
separate_rows(value, sep = "(?<!^)(?!$)") %>%
group_by(char = tolower(value)) %>%
summarise(N = n())
# A tibble: 10 × 2
char N
<chr> <int>
1 1 4
2 2 2
3 3 2
4 e 1
5 g 1
6 h 1
7 i 2
8 n 1
9 s 2
10 t 3
CodePudding user response:
You can use tidytext
:
library(tidytext)
library(tidyr)
library(dplyr)
df %>%
pivot_longer(everything()) %>%
unnest_tokens(value, value, token = "characters") %>%
count(value)
output
# A tibble: 10 × 2
value n
<chr> <int>
1 1 4
2 2 2
3 3 2
4 e 1
5 g 1
6 h 1
7 i 2
8 n 1
9 s 2
10 t 3