How to use group_by with mean and sum in dplyr?-CodePudding

I have a dataset like something like the one below with 100's of lines. I want to average the student score by year and school. So I would have one score for ISD 1 for 2019 and ISD 1 for 2020, etc. I am using this code to do that.

df <- df %>%
  group_by(Year, `School Name`) %>% 
  summarise(across(everything(), .f = list(mean = mean), na.rm = TRUE))

But I need the n count for average. I want to know how many students went into getting the average. How do I do that with the NAs?

Year	School Name	Student Score	Student Score
2019	ISD 1	1	NA
2020	ISD 4	4	2
2020	ISD 3	NA	3
2018	ISD 1	4	NA
2019	ISD 4	2	5
2020	ISD 4	3	2
2019	ISD 3	NA	1
2018	ISD1	2	4

CodePudding user response：

If I understood correctly, this might help you

#Libraries

library(tidyverse)
library(lubridate)

#Data

df <-
 tibble::tribble(
    ~Year, ~School.Name, ~Student.Score1, ~Student.Score2,
    2019L,      "ISD 1",             1L,             NA,
    2020L,      "ISD 4",             4L,             2L,
    2020L,      "ISD 3",             NA,             3L,
    2018L,      "ISD 1",             4L,             NA,
    2019L,      "ISD 4",             2L,             5L,
    2020L,      "ISD 4",             3L,             2L,
    2019L,      "ISD 3",             NA,             1L,
    2018L,      "ISD 1",             2L,             4L
    )

#How to

df %>% 
  group_by(Year,School.Name) %>% 
  summarise(
    n = n(),
    across(.cols = contains(".Score"),.fns = function(x)mean(x,na.rm = TRUE))
  )

# A tibble: 6 x 5
# Groups:   Year [3]
   Year School.Name     n Student.Score1 Student.Score2
  <int> <chr>       <int>          <dbl>          <dbl>
1  2018 ISD 1           2            3                4
2  2019 ISD 1           1            1              NaN
3  2019 ISD 3           1          NaN                1
4  2019 ISD 4           1            2                5
5  2020 ISD 3           1          NaN                3
6  2020 ISD 4           2            3.5              2

CodePudding user response：

I'm guessing the Student Score columns represent separate students who should be looked at in combination with other students from the same school and year. If that's the case, then you probably should reshape your data into long format first, like below:

library(dplyr); library(tidyr)
df %>% 
  # reshape, keeping Year and School Name as keys
  pivot_longer(-c(Year, `School.Name`)) %>%
  group_by(Year, `School.Name`) %>% 
  filter(!is.na(value)) %>%
  summarise(mean = mean(value),
            n = n(), .groups = "drop")

Result

   Year School.Name  mean     n
  <int> <chr>       <dbl> <int>
1  2018 ISD 1        4        1
2  2018 ISD1         3        2
3  2019 ISD 1        1        1
4  2019 ISD 3        1        1
5  2019 ISD 4        3.5      2
6  2020 ISD 3        3        1
7  2020 ISD 4        2.75     4

(Note, I have used the data as-is, but I suspect "ISD 1" and "ISD1" are supposed to be the same thing, in which case you may need to do some data cleaning first.)

Starting data: (Note, the non-uniquely named Student Score columns are renamed by the data.frame function to have unique names, e.g. Student.Score and Student.Score.1)

df <- data.frame(
  stringsAsFactors = FALSE,
              Year = c(2019L, 2020L, 2020L, 2018L, 2019L, 2020L, 2019L, 2018L),
       `School Name` = c("ISD 1","ISD 4","ISD 3",
                       "ISD 1","ISD 4","ISD 4","ISD 3","ISD1"),
     `Student Score` = c(1L, 4L, NA, 4L, 2L, 3L, NA, 2L),
     `Student Score` = c(NA, 2L, 3L, NA, 5L, 2L, 1L, 4L)
)

CodePudding user response：

Pivoting longer might be a good approach.

df %>% 
  pivot_longer(cols = c(-Year, -`School Name`)) %>% 
  group_by(Year, `School Name`) %>%
  summarise(mean = mean(value, na.rm = T))

output


# A tibble: 6 x 3
# Groups:   Year [3]
   Year `School Name`  mean
  <int> <chr>         <dbl>
1  2018 ISD 1          3.33
2  2019 ISD 1          1   
3  2019 ISD 3          1   
4  2019 ISD 4          3.5 
5  2020 ISD 3          3   
6  2020 ISD 4          2.75