Home > Blockchain >  How to use group_by with mean and sum in dplyr?
How to use group_by with mean and sum in dplyr?

Time:09-22

I have a dataset like something like the one below with 100's of lines. I want to average the student score by year and school. So I would have one score for ISD 1 for 2019 and ISD 1 for 2020, etc. I am using this code to do that.

df <- df %>%
  group_by(Year, `School Name`) %>% 
  summarise(across(everything(), .f = list(mean = mean), na.rm = TRUE))

But I need the n count for average. I want to know how many students went into getting the average. How do I do that with the NAs?

Year School Name Student Score Student Score
2019 ISD 1 1 NA
2020 ISD 4 4 2
2020 ISD 3 NA 3
2018 ISD 1 4 NA
2019 ISD 4 2 5
2020 ISD 4 3 2
2019 ISD 3 NA 1
2018 ISD1 2 4

CodePudding user response:

If I understood correctly, this might help you

#Libraries

library(tidyverse)
library(lubridate)

#Data

df <-
 tibble::tribble(
    ~Year, ~School.Name, ~Student.Score1, ~Student.Score2,
    2019L,      "ISD 1",             1L,             NA,
    2020L,      "ISD 4",             4L,             2L,
    2020L,      "ISD 3",             NA,             3L,
    2018L,      "ISD 1",             4L,             NA,
    2019L,      "ISD 4",             2L,             5L,
    2020L,      "ISD 4",             3L,             2L,
    2019L,      "ISD 3",             NA,             1L,
    2018L,      "ISD 1",             2L,             4L
    )

#How to

df %>% 
  group_by(Year,School.Name) %>% 
  summarise(
    n = n(),
    across(.cols = contains(".Score"),.fns = function(x)mean(x,na.rm = TRUE))
  )

# A tibble: 6 x 5
# Groups:   Year [3]
   Year School.Name     n Student.Score1 Student.Score2
  <int> <chr>       <int>          <dbl>          <dbl>
1  2018 ISD 1           2            3                4
2  2019 ISD 1           1            1              NaN
3  2019 ISD 3           1          NaN                1
4  2019 ISD 4           1            2                5
5  2020 ISD 3           1          NaN                3
6  2020 ISD 4           2            3.5              2

CodePudding user response:

I'm guessing the Student Score columns represent separate students who should be looked at in combination with other students from the same school and year. If that's the case, then you probably should reshape your data into long format first, like below:

library(dplyr); library(tidyr)
df %>% 
  # reshape, keeping Year and School Name as keys
  pivot_longer(-c(Year, `School.Name`)) %>%
  group_by(Year, `School.Name`) %>% 
  filter(!is.na(value)) %>%
  summarise(mean = mean(value),
            n = n(), .groups = "drop")

Result

   Year School.Name  mean     n
  <int> <chr>       <dbl> <int>
1  2018 ISD 1        4        1
2  2018 ISD1         3        2
3  2019 ISD 1        1        1
4  2019 ISD 3        1        1
5  2019 ISD 4        3.5      2
6  2020 ISD 3        3        1
7  2020 ISD 4        2.75     4

(Note, I have used the data as-is, but I suspect "ISD 1" and "ISD1" are supposed to be the same thing, in which case you may need to do some data cleaning first.)

Starting data: (Note, the non-uniquely named Student Score columns are renamed by the data.frame function to have unique names, e.g. Student.Score and Student.Score.1)

df <- data.frame(
  stringsAsFactors = FALSE,
              Year = c(2019L, 2020L, 2020L, 2018L, 2019L, 2020L, 2019L, 2018L),
       `School Name` = c("ISD 1","ISD 4","ISD 3",
                       "ISD 1","ISD 4","ISD 4","ISD 3","ISD1"),
     `Student Score` = c(1L, 4L, NA, 4L, 2L, 3L, NA, 2L),
     `Student Score` = c(NA, 2L, 3L, NA, 5L, 2L, 1L, 4L)
) 

CodePudding user response:

Pivoting longer might be a good approach.

df %>% 
  pivot_longer(cols = c(-Year, -`School Name`)) %>% 
  group_by(Year, `School Name`) %>%
  summarise(mean = mean(value, na.rm = T))

output


# A tibble: 6 x 3
# Groups:   Year [3]
   Year `School Name`  mean
  <int> <chr>         <dbl>
1  2018 ISD 1          3.33
2  2019 ISD 1          1   
3  2019 ISD 3          1   
4  2019 ISD 4          3.5 
5  2020 ISD 3          3   
6  2020 ISD 4          2.75

  • Related