Home > Software engineering >  R: table frequencies of letters in string based on Alphabet
R: table frequencies of letters in string based on Alphabet

Time:06-14

I need to compute letter frequencies of a large list of words. For each of the locations in the word (first, second,..), I need to find how many times each letter (a-z) appeared in the list and then table the data according to the word positon.

For example, if my word list is: words <- c("swims", "seems", "gills", "draws", "which", "water")

then the result table should like that:

letter first position second position third position fourth position fifth position
a 0 1 1 0 0
b 0 0 0 0 0
c 0 0 0 1 0
d 1 0 0 0 0
e 0 1 1 1 0
f 0 0 0 0 0
...continued until z ... ... ... ... ...

All words are of same length (5).

What I have so far is:

alphabet <- letters[1:26]

words.df <- data.frame("Words" = words)

words.df <- words.df %>% mutate("First_place" = substr(words.df$words,1,1))
words.df <- words.df %>% mutate("Second_place" = substr(words.df$words,2,2))
words.df <- words.df %>% mutate("Third_place" = substr(words.df$words,3,3))
words.df <- words.df %>% mutate("Fourth_place" = substr(words.df$words,4,4))
words.df <- words.df %>% mutate("Fifth_place" = substr(words.df$words,5,5))



x1 <- words.df$First_place
x1 <- table(factor(x1,alphabet))


x2 <- words.df$Second_place
x2 <- table(factor(x2,alphabet))


x3 <- words.df$Third_place
x3 <- table(factor(x3,alphabet))


x4 <- words.df$Fourth_place
x4 <- table(factor(x4,alphabet))


x5 <- words.df$Fifth_place
x5 <- table(factor(x5,alphabet))

My code is not effective and gives tables to each letter position sepretely. All help will be appreicated.

CodePudding user response:

in base R use table:

table(let = unlist(strsplit(words,'')),pos = sequence(nchar(words)))

   pos
let 1 2 3 4 5
  a 0 1 1 0 0
  c 0 0 0 1 0
  d 1 0 0 0 0
  e 0 1 1 1 0
  g 1 0 0 0 0
  h 0 1 0 0 1
  i 0 1 2 0 0
  l 0 0 1 1 0
  m 0 0 0 2 0
  r 0 1 0 0 1
  s 2 0 0 0 4
  t 0 0 1 0 0
  w 2 1 0 1 0

Note that if you need all the values from a-z then use

table(factor(unlist(strsplit(words,'')), letters), sequence(nchar(words)))

Also to get a dataframe you could do:

d <- table(factor(unlist(strsplit(words,'')), letters), sequence(nchar(words)))
cbind(letters = rownames(d), as.data.frame.matrix(d))

CodePudding user response:

Here is a tidyverse solution using dplyr, purrr, and tidyr:

strsplit(words.df$Words, "") %>% 
  map_dfr(~setNames(.x, seq_along(.x))) %>% 
  pivot_longer(everything(), 
               values_drop_na = T,
               names_to = "pos",
               values_to = "letter") %>% 
  count(pos, letter) %>% 
  pivot_wider(names_from = pos,
              names_glue =  "pos{pos}",
              id_cols = letter,
              values_from = n,
              values_fill = 0L)

Output


   letter pos1 pos2 pos3 pos4 pos5 pos6 pos7 pos8 pos9 pos10 pos11
1       a   65  127   88   38   28   17   14    5    3     0     0
2       b   58    4    7    9    2    4    2    0    1     0     0
3       c   83   14   45   37   20   19    8    3    2     0     0
4       C    2    0    0    0    0    0    0    0    0     0     0
5       d   43    8   33   47   21   22    9    3    1     1     0
6       e   45  156   81  132  114   69   48   23   14     2     2
7       f   54   11   18   10    5    2    1    0    0     0     0
8       g   23    7   27   21   15    8    7    1    0     0     0
9       h   38   56    6   28   21   10    3    3    1     1     0
10      i   25  106   51   58   38   28    8    4    1     0     0
11      j    6    0    2    2    0    0    0    0    0     0     0
12      k    9    1    6   22   12    0    0    0    0     0     0
13      l   45   41   54   54   36    9    7    6    0     2     0
14      m   45    8   31   19    8    8    4    2    0     0     0
15      n   23   42   75   53   34   41   16   16    4     2     0
16      o   28  167   76   41   38    9   11    2    1     0     0
17      p   72   20   34   30    8    3    1    1    1     0     0
18      q    7    2    1    0    0    0    0    0    0     0     0
19      r   46   74   92   59   56   45   12    9    1     1     0
20      s  119    8   67   35   31   22   18    4    1     0     0
21      t   65   30   73   83   57   42   31    9    6     3     1
22      u   12   66   39   36   20    7    7    2    0     0     0
23      v    8    7   20   12    5    5    1    0    0     0     0
24      w   53    8   13   10    2    3    0    1    0     0     0
25      y    6    4   16   15   17   15   10    5    6     1     1
26      x    0   12    5    0    0    0    0    0    0     0     0
27      z    0    0    1    0    0    0    1    1    0     0     0
  • Related