Home > database >  Count occurrence of string values per row in dataframe in R (dplyr)
Count occurrence of string values per row in dataframe in R (dplyr)

Time:09-30

I am looking to count the number of occurrences of select string values per row in a dataframe. Ideally, this would be completed using the dplyr package.

Here is a dataframe similar to the one I am working with:

df <- tibble(
  d1 = c('b', 'a', 'a', 'a', 'a', 'a', 'a', 'a'),
  d2 = c('a', 'a', 'b', 'a', 'a', 'a', 'a', 'a'),
  d3 = c('a', 'a', 'a', 'c', 'a', 'b', 'a', 'a'),
  d4 = c('a', 'a', 'a', 'a', 'a', 'a', 'a', 'd'),
  d5 = c('a', 'c', 'a', 'a', 'a', 'a', 'a', 'a'),
  d6 = c('a', 'a', 'a', 'b', 'a', 'a', 'd', 'a'),
  d7 = c('a', 'a', 'a', 'a', 'a', 'a', 'a', 'a'),
  d8 = c('a', 'a', 'a', 'a', 'a', 'a', 'a', 'a')
)

I would like to be able to define the columns that I am counting over:

cols <- c('d2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8')

I would also like to define the strings I am searching for:

bcde <- c('b', 'c', 'd', 'e')

Thus far, I have been able to identify that one of the strings in bcde is present in one of cols d2:d8, per row, using the following code:

df <- df %>% 
  mutate(
    d9 = case_when(
      if_any(all_of(cols), ~ . %in% bcde) ~ 1,
      TRUE ~ 0)
  )

Which produces:

  d1    d2    d3    d4    d5    d6    d7    d8       d9
  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 b     a     a     a     a     a     a     a         0
2 a     a     a     a     c     a     a     a         1
3 a     b     a     a     a     a     a     a         1
4 a     a     c     a     a     b     a     a         1
5 a     a     a     a     a     a     a     a         0
6 a     a     b     a     a     a     a     a         1
7 a     a     a     a     a     d     a     a         1
8 a     a     a     d     a     a     a     a         1

However, I would like to produce:

  d1    d2    d3    d4    d5    d6    d7    d8       d9
  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 b     a     a     a     a     a     a     a         0
2 a     a     a     a     c     a     a     a         1
3 a     b     a     a     a     a     a     a         1
4 a     a     c     a     a     b     a     a         2
5 a     a     a     a     a     a     a     a         0
6 a     a     b     a     a     a     a     a         1
7 a     a     a     a     a     d     a     a         1
8 a     a     a     d     a     a     a     a         1

(i.e. count the number of occurrences if a string in bcde in cols d2:d8, rather than merely check for their presence).

I am unsure how to proceed by incorporating a counting function whilst looping through the dataframe.

Thanks!

CodePudding user response:

You can use across with rowSums -

library(dplyr)

df %>% mutate(d9 = rowSums(across(all_of(cols), `%in%`, bcde)))

#  d1    d2    d3    d4    d5    d6    d7    d8       d9
#  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#1 b     a     a     a     a     a     a     a         0
#2 a     a     a     a     c     a     a     a         1
#3 a     b     a     a     a     a     a     a         1
#4 a     a     c     a     a     b     a     a         2
#5 a     a     a     a     a     a     a     a         0
#6 a     a     b     a     a     a     a     a         1
#7 a     a     a     a     a     d     a     a         1
#8 a     a     a     d     a     a     a     a         1

This can also be written in base R -

df$d9 <- rowSums(sapply(df[cols], `%in%`, bcde))

CodePudding user response:

This could be also done in base R with axis wise apply, on axis 2:

df$d9 <- rowSums(apply(df[cols], 2, `%in%`, bcde))

And now:

df

Gives:

  d1    d2    d3    d4    d5    d6    d7    d8       d9
  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 b     a     a     a     a     a     a     a         0
2 a     a     a     a     c     a     a     a         1
3 a     b     a     a     a     a     a     a         1
4 a     a     c     a     a     b     a     a         2
5 a     a     a     a     a     a     a     a         0
6 a     a     b     a     a     a     a     a         1
7 a     a     a     a     a     d     a     a         1
8 a     a     a     d     a     a     a     a         1

CodePudding user response:

Here is a dplyr-esque solution:

library(tidyverse)

df <- tibble(
  d1 = c('b', 'a', 'a', 'a', 'a', 'a', 'a', 'a'),
  d2 = c('a', 'a', 'b', 'a', 'a', 'a', 'a', 'a'),
  d3 = c('a', 'a', 'a', 'c', 'a', 'b', 'a', 'a'),
  d4 = c('a', 'a', 'a', 'a', 'a', 'a', 'a', 'd'),
  d5 = c('a', 'c', 'a', 'a', 'a', 'a', 'a', 'a'),
  d6 = c('a', 'a', 'a', 'b', 'a', 'a', 'd', 'a'),
  d7 = c('a', 'a', 'a', 'a', 'a', 'a', 'a', 'a'),
  d8 = c('a', 'a', 'a', 'a', 'a', 'a', 'a', 'a')
)

cols <- c('d2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8')
bcde <- c('b', 'c', 'd', 'e')

t(df) %>%
 as.data.frame() -> dft

dft[cols, ] %>% 
  map_dbl(~. %in% bcde %>% sum) %>%
  {mutate(df, d9 = .)}
#> # A tibble: 8 × 9
#>   d1    d2    d3    d4    d5    d6    d7    d8       d9
#>   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 b     a     a     a     a     a     a     a         0
#> 2 a     a     a     a     c     a     a     a         1
#> 3 a     b     a     a     a     a     a     a         1
#> 4 a     a     c     a     a     b     a     a         2
#> 5 a     a     a     a     a     a     a     a         0
#> 6 a     a     b     a     a     a     a     a         1
#> 7 a     a     a     a     a     d     a     a         1
#> 8 a     a     a     d     a     a     a     a         1

Created on 2021-09-29 by the reprex package (v2.0.1)

  • Related