I have a dataset like something like the one below with 100's of lines. I want to average the student score by year and school. So I would have one score for ISD 1 for 2019 and ISD 1 for 2020, etc. I am using this code to do that.
df <- df %>%
group_by(Year, `School Name`) %>%
summarise(across(everything(), .f = list(mean = mean), na.rm = TRUE))
But I need the n count for average. I want to know how many students went into getting the average. How do I do that with the NAs?
Year | School Name | Student Score | Student Score |
---|---|---|---|
2019 | ISD 1 | 1 | NA |
2020 | ISD 4 | 4 | 2 |
2020 | ISD 3 | NA | 3 |
2018 | ISD 1 | 4 | NA |
2019 | ISD 4 | 2 | 5 |
2020 | ISD 4 | 3 | 2 |
2019 | ISD 3 | NA | 1 |
2018 | ISD1 | 2 | 4 |
CodePudding user response:
If I understood correctly, this might help you
#Libraries
library(tidyverse)
library(lubridate)
#Data
df <-
tibble::tribble(
~Year, ~School.Name, ~Student.Score1, ~Student.Score2,
2019L, "ISD 1", 1L, NA,
2020L, "ISD 4", 4L, 2L,
2020L, "ISD 3", NA, 3L,
2018L, "ISD 1", 4L, NA,
2019L, "ISD 4", 2L, 5L,
2020L, "ISD 4", 3L, 2L,
2019L, "ISD 3", NA, 1L,
2018L, "ISD 1", 2L, 4L
)
#How to
df %>%
group_by(Year,School.Name) %>%
summarise(
n = n(),
across(.cols = contains(".Score"),.fns = function(x)mean(x,na.rm = TRUE))
)
# A tibble: 6 x 5
# Groups: Year [3]
Year School.Name n Student.Score1 Student.Score2
<int> <chr> <int> <dbl> <dbl>
1 2018 ISD 1 2 3 4
2 2019 ISD 1 1 1 NaN
3 2019 ISD 3 1 NaN 1
4 2019 ISD 4 1 2 5
5 2020 ISD 3 1 NaN 3
6 2020 ISD 4 2 3.5 2
CodePudding user response:
I'm guessing the Student Score
columns represent separate students who should be looked at in combination with other students from the same school and year. If that's the case, then you probably should reshape your data into long format first, like below:
library(dplyr); library(tidyr)
df %>%
# reshape, keeping Year and School Name as keys
pivot_longer(-c(Year, `School.Name`)) %>%
group_by(Year, `School.Name`) %>%
filter(!is.na(value)) %>%
summarise(mean = mean(value),
n = n(), .groups = "drop")
Result
Year School.Name mean n
<int> <chr> <dbl> <int>
1 2018 ISD 1 4 1
2 2018 ISD1 3 2
3 2019 ISD 1 1 1
4 2019 ISD 3 1 1
5 2019 ISD 4 3.5 2
6 2020 ISD 3 3 1
7 2020 ISD 4 2.75 4
(Note, I have used the data as-is, but I suspect "ISD 1" and "ISD1" are supposed to be the same thing, in which case you may need to do some data cleaning first.)
Starting data: (Note, the non-uniquely named Student Score columns are renamed by the data.frame
function to have unique names, e.g. Student.Score and Student.Score.1)
df <- data.frame(
stringsAsFactors = FALSE,
Year = c(2019L, 2020L, 2020L, 2018L, 2019L, 2020L, 2019L, 2018L),
`School Name` = c("ISD 1","ISD 4","ISD 3",
"ISD 1","ISD 4","ISD 4","ISD 3","ISD1"),
`Student Score` = c(1L, 4L, NA, 4L, 2L, 3L, NA, 2L),
`Student Score` = c(NA, 2L, 3L, NA, 5L, 2L, 1L, 4L)
)
CodePudding user response:
Pivoting longer might be a good approach.
df %>%
pivot_longer(cols = c(-Year, -`School Name`)) %>%
group_by(Year, `School Name`) %>%
summarise(mean = mean(value, na.rm = T))
output
# A tibble: 6 x 3
# Groups: Year [3]
Year `School Name` mean
<int> <chr> <dbl>
1 2018 ISD 1 3.33
2 2019 ISD 1 1
3 2019 ISD 3 1
4 2019 ISD 4 3.5
5 2020 ISD 3 3
6 2020 ISD 4 2.75