Count the frequency of values across row every 20 columns-CodePudding

I have a large (5k rows) dataset that looks like this

ID  Hour intensity00 intensity01 intensity02 ... intensity59
21    1      0           0           1              2
21    2      1           0           0              1
22    0      0           1           1              0

How would I add a column that has the counts of the 0's,1's,2's that are in every 20 columns

So it would split intensity00 - intensity 19 then intensity20 - intensity39 then intensity40 - intensity59 and have the number of 0's 1's and 2's for each 20 column group appended at the end.

I have tried:

solution <- table(unlist(temp3[1,3:22]))

but I cannot figure out how to use that to iterate over each group of 20 columns and every row.

CodePudding user response：

Here we use tidyr::pivot_longer() to create groups by the original column names. Using rounded division of the group size we create the groups. Then we simply dplyr::count() to get the tabulation. Finally we tidyr::pivot_wider() again to put it into the format you were expecting.

I think it's much easier to work with in this format, but if you need it stitched into your original data as you stated in the question, you can join and reorder using the last block of code below.

library(tidyverse)

set.seed(1)
d <- replicate(60, sample(0:2, 3, T)) %>% 
  as.data.frame() %>% 
  setNames(paste0("intensity", formatC(0:59, width = 2, flag = "0"))) %>% 
  mutate(ID = c(21, 21, 22), Hour = c(1, 2, 0), .before = 1)

d %>% 
  pivot_longer(starts_with("intensity"), names_prefix = "intensity", names_to = "intensity") %>% 
  mutate(grp = floor(as.numeric(intensity)/20)   1) %>% 
  count(ID, Hour, grp, value) %>% 
  mutate(grp = paste0("grp", grp)) %>% 
  pivot_wider(names_from = c(grp, value), names_sep = "_", values_from = "n")
#> # A tibble: 3 x 11
#>      ID  Hour grp1_0 grp1_1 grp1_2 grp2_0 grp2_1 grp2_2 grp3_0 grp3_1 grp3_2
#>   <dbl> <dbl>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>
#> 1    21     1      7      8      5      7      8      5      9      5      6
#> 2    21     2      6      9      5      6      4     10      6      6      8
#> 3    22     0      6     10      4      6      5      9      5      8      7

^{Created on 2022-02-14 by the reprex package (v2.0.1)}

To return summary results embedded with original data:

  left_join(., d) %>%  
  select(ID:Hour, 
         intensity00:intensity19, 
         starts_with("grp1"),
         intensity20:intensity39,
         starts_with("grp2"),
         intensity40:intensity59,
         starts_with("grp3"))

CodePudding user response：

Split the data every n column using split.default, in my example it is every 5 columns. Then loop through 0, 1, 2 get counts per row using rowSums:

# example data
set.seed(1); d <- data.frame(matrix(sample(c(0, 1, 2), 20, replace = TRUE), nrow = 2))
d
#   X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# 1  0  0  0  2  1  2  0  1  1   2
# 2  2  1  2  1  2  0  0  1  1   0

do.call(cbind, 
        lapply(split.default(d, rep(1:2, each = 5)), function(i){
          sapply(0:2, function(x) rowSums(i == x))
        }))
#      [,1] [,2] [,3] [,4] [,5] [,6]
# [1,]    3    1    1    1    2    2
# [2,]    0    2    3    3    2    0

CodePudding user response：

I have a suggestion for how to implement a solution with mutate and sum. It's not as elegant as the other answers but perhaps still useful.

I tested the code with the sample data frame created in the answer by @Dan Adams, and I got the same result.

library(tidyverse)

d2 <- d %>%
  rowwise() %>%
  #Frequencies for first group
  mutate(cond1_0 = sum(c_across(3:22) == 0)) %>%
  mutate(cond1_1 = sum(c_across(3:22) == 1)) %>%
  mutate(cond1_2 = sum(c_across(3:22) == 2)) %>%
  #Frequencies for second group
  mutate(cond2_0 = sum(c_across(23:42) == 0)) %>%
  mutate(cond2_1 = sum(c_across(23:42) == 1)) %>%
  mutate(cond2_2 = sum(c_across(23:42) == 2)) %>%
  #Frequencies for third group
  mutate(cond3_0 = sum(c_across(43:62) == 0)) %>%
  mutate(cond3_1 = sum(c_across(43:62) == 1)) %>%
  mutate(cond3_2 = sum(c_across(43:62) == 2))