I am looking to count the number of occurrences of select string values per row in a dataframe. Ideally, this would be completed using the dplyr
package.
Here is a dataframe similar to the one I am working with:
df <- tibble(
d1 = c('b', 'a', 'a', 'a', 'a', 'a', 'a', 'a'),
d2 = c('a', 'a', 'b', 'a', 'a', 'a', 'a', 'a'),
d3 = c('a', 'a', 'a', 'c', 'a', 'b', 'a', 'a'),
d4 = c('a', 'a', 'a', 'a', 'a', 'a', 'a', 'd'),
d5 = c('a', 'c', 'a', 'a', 'a', 'a', 'a', 'a'),
d6 = c('a', 'a', 'a', 'b', 'a', 'a', 'd', 'a'),
d7 = c('a', 'a', 'a', 'a', 'a', 'a', 'a', 'a'),
d8 = c('a', 'a', 'a', 'a', 'a', 'a', 'a', 'a')
)
I would like to be able to define the columns that I am counting over:
cols <- c('d2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8')
I would also like to define the strings I am searching for:
bcde <- c('b', 'c', 'd', 'e')
Thus far, I have been able to identify that one of the strings in bcde
is present in one of cols d2:d8
, per row, using the following code:
df <- df %>%
mutate(
d9 = case_when(
if_any(all_of(cols), ~ . %in% bcde) ~ 1,
TRUE ~ 0)
)
Which produces:
d1 d2 d3 d4 d5 d6 d7 d8 d9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 b a a a a a a a 0
2 a a a a c a a a 1
3 a b a a a a a a 1
4 a a c a a b a a 1
5 a a a a a a a a 0
6 a a b a a a a a 1
7 a a a a a d a a 1
8 a a a d a a a a 1
However, I would like to produce:
d1 d2 d3 d4 d5 d6 d7 d8 d9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 b a a a a a a a 0
2 a a a a c a a a 1
3 a b a a a a a a 1
4 a a c a a b a a 2
5 a a a a a a a a 0
6 a a b a a a a a 1
7 a a a a a d a a 1
8 a a a d a a a a 1
(i.e. count the number of occurrences if a string in bcde
in cols d2:d8
, rather than merely check for their presence).
I am unsure how to proceed by incorporating a counting function whilst looping through the dataframe.
Thanks!
CodePudding user response:
You can use across
with rowSums
-
library(dplyr)
df %>% mutate(d9 = rowSums(across(all_of(cols), `%in%`, bcde)))
# d1 d2 d3 d4 d5 d6 d7 d8 d9
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#1 b a a a a a a a 0
#2 a a a a c a a a 1
#3 a b a a a a a a 1
#4 a a c a a b a a 2
#5 a a a a a a a a 0
#6 a a b a a a a a 1
#7 a a a a a d a a 1
#8 a a a d a a a a 1
This can also be written in base R -
df$d9 <- rowSums(sapply(df[cols], `%in%`, bcde))
CodePudding user response:
This could be also done in base R with axis wise apply
, on axis
2
:
df$d9 <- rowSums(apply(df[cols], 2, `%in%`, bcde))
And now:
df
Gives:
d1 d2 d3 d4 d5 d6 d7 d8 d9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 b a a a a a a a 0
2 a a a a c a a a 1
3 a b a a a a a a 1
4 a a c a a b a a 2
5 a a a a a a a a 0
6 a a b a a a a a 1
7 a a a a a d a a 1
8 a a a d a a a a 1
CodePudding user response:
Here is a dplyr-esque solution:
library(tidyverse)
df <- tibble(
d1 = c('b', 'a', 'a', 'a', 'a', 'a', 'a', 'a'),
d2 = c('a', 'a', 'b', 'a', 'a', 'a', 'a', 'a'),
d3 = c('a', 'a', 'a', 'c', 'a', 'b', 'a', 'a'),
d4 = c('a', 'a', 'a', 'a', 'a', 'a', 'a', 'd'),
d5 = c('a', 'c', 'a', 'a', 'a', 'a', 'a', 'a'),
d6 = c('a', 'a', 'a', 'b', 'a', 'a', 'd', 'a'),
d7 = c('a', 'a', 'a', 'a', 'a', 'a', 'a', 'a'),
d8 = c('a', 'a', 'a', 'a', 'a', 'a', 'a', 'a')
)
cols <- c('d2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8')
bcde <- c('b', 'c', 'd', 'e')
t(df) %>%
as.data.frame() -> dft
dft[cols, ] %>%
map_dbl(~. %in% bcde %>% sum) %>%
{mutate(df, d9 = .)}
#> # A tibble: 8 × 9
#> d1 d2 d3 d4 d5 d6 d7 d8 d9
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 b a a a a a a a 0
#> 2 a a a a c a a a 1
#> 3 a b a a a a a a 1
#> 4 a a c a a b a a 2
#> 5 a a a a a a a a 0
#> 6 a a b a a a a a 1
#> 7 a a a a a d a a 1
#> 8 a a a d a a a a 1
Created on 2021-09-29 by the reprex package (v2.0.1)