Home > Enterprise >  Count the number of occurances of multiple letters in a variable within a dataframe?
Count the number of occurances of multiple letters in a variable within a dataframe?

Time:10-15

Just say I want to cont the number of "a"'s and "p"'s in the word "apple", I can do:

library(stringr)
sum(str_count("apple", c("b", "k")))

but when I try to apply this logic in order to count the number of "a"'s and "p"'s across multiple different words in a variable in a dataframe, it doesn't work, e.g.:

dat <- tibble(id = 1:4, word = c(c("apple", "banana", "pear", "pineapple")))
dat <- dat %>% mutate(num_ap = sum(str_count(word, c("a", "p"))))

it doesn't work. I the variable "num_ap" should read c(3, 3, 2, 4) but instead it reads c(5, 5, 5, 5)

Does anyone know why this isn't working for me?

Thanks!

CodePudding user response:

In cases like this it helps to backtrack the issue.

str_count(dat$word, c("a", "p")) by itself will return [1] 1 0 1 3. Each number represents the number of times the letter 'p' appears in each word in your data frame. If you take the sum of that vector with sum(str_count(dat$word, c("a", "p"))), you get [1] 5. Since you are not going row by row, each row will be assigned a value of 5, which is consistent with your results.

To fix this, note that the function rowwise() (part of the dplyr library) allows you to do work with each row individually. Hence, modifying your code to incorporate the rowwise() function will solve your problem:

dat <- dat %>% rowwise() %>% mutate(num_ap = sum(str_count(word, c("a", "p"))))

CodePudding user response:

sapply the transformation to each element of dat$word

library(stringr)
dat <- data.frame(id = 1:4, word = c(c("apple", "banana", "pear", "pineapple")))
dat$num_ap <- sapply(dat$word, function(x) sum(str_count(x, c("a", "p"))))

dat
#>   id      word num_ap
#> 1  1     apple      3
#> 2  2    banana      3
#> 3  3      pear      2
#> 4  4 pineapple      4

Created on 2021-10-14 by the reprex package (v2.0.1)

CodePudding user response:

Two solutions (both without sum):

with rowwise():

library(dplyr)
library(stringr)
dat %>%
  rowwise() %>%
  mutate(num_ap = str_count(word, "a|p"))
  id      word num_ap
1  1     apple      3
2  2    banana      3
3  3      pear      2
4  4 pineapple      4

with lengths and str_extract_all:

library(dplyr)
library(stringr)
dat %>%
  mutate(num_ap = lengths(str_extract_all(word, "a|p")))
  id      word num_ap
1  1     apple      3
2  2    banana      3
3  3      pear      2
4  4 pineapple      4

CodePudding user response:

Using base R

dat$num_ap <-  nchar(gsub("[^ap]", "", dat$word))

-output

> dat
  id      word num_ap
1  1     apple      3
2  2    banana      3
3  3      pear      2
4  4 pineapple      4

data

dat <- structure(list(id = 1:4, word = c("apple", "banana", "pear", 
"pineapple")), class = "data.frame", row.names = c(NA, -4L))
  • Related