Home > Software engineering >  How to combine across, summarize, and n() in R to get number of non-NA values by column?
How to combine across, summarize, and n() in R to get number of non-NA values by column?

Time:10-27

I have a list of questions, and I want to know how many rows have non-NA values using summarize. I want to use summarize because I'm already using that to calculate the average, which works in the below code. Why does the below code not work and how can I fix it?

library(dplyr)

test <- tibble(student = c("j", "c", "s"),
               q1 = c(1, 2, 3),
               q2 = c(NA_real_, NA_real_, 4),
               q3 = c(43, NA_real_, 232))

test %>%
  dplyr::summarise(n = across(starts_with("q"), ~n(.x)),
                   avg = across(contains("q"), ~ round(mean(.x, na.rm = T), 2)))


expected_outcome <- tibble(n_q1 = 3,
                           n_q2 = 1,
                           n_q3 = 2,
                           avg_q1 = 2,
                           avg_q2 = 4,
                           avg_q3 = 138)

CodePudding user response:

library(dplyr)

test %>% 
  summarize(across(starts_with("q"), list(n = ~sum(!is.na(.)),
                                         avg = ~mean(., na.rm = T)),
                   .names = "{.fn}_{.col}"))

From the ?across documentation, you can pass a list to the .fns argument:

A list of functions/lambdas, e.g. list(mean = mean, n_miss = ~ sum(is.na(.x))

This will apply every function in that list to the columns you have specified. You can then use the .names argument of across to set the column names how you desire.

Output

   n_q1 avg_q1  n_q2 avg_q2  n_q3 avg_q3
  <int>  <dbl> <int>  <dbl> <int>  <dbl>
1     3      2     1      4     2   138.

CodePudding user response:

Update: Upps I missed the whole question. sorry: But here is an alternative just for fun: The preferred answer is already given by @LMc:

library(dplyr)

test %>% 
  summarise(across(starts_with("q"), list(avg = ~mean(., na.rm = T)),
                   .names = "{.fn}_{.col}")) %>% 
  bind_cols(test %>% purrr::map_df(~sum(!is.na(.))))

  avg_q1 avg_q2 avg_q3 student    q1    q2    q3
   <dbl>  <dbl>  <dbl>   <int> <int> <int> <int>
1      2      4   138.       3     3     1     2
test %>% 
  summarise(across(starts_with("q"), list(avg = ~mean(., na.rm = T)),
                   .names = "{.fn}_{.col}")) %>% 
  bind_cols(test %>% purrr::map_df(~sum(!is.na(.))))

First not full answer: To get the non-nas of the whole dataset, we could do this:

library(dplyr)

test %>% 
  purrr::map_df(~sum(!is.na(.)))
 student    q1    q2    q3
    <int> <int> <int> <int>
1       3     3     1     2
  • Related