I have a data frame in the following form (grouped by Year, Subject and Level). There are two „Levels“ and around 25 Subjects (the number varies per year) over 30 years.
Code to reproduce:
structure(list(Year = c(2005L, 2005L, 2005L, 2005L, 2005L, 2005L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2008L, 2008L), Subject = c("Englisch",
"Englisch", "GK", "GK", "Mathematik", "Mathematik", "Englisch", "Englisch", "GK", "GK", "Mathematik", "Mathematik", "Englisch", "Englisch", "GK", "GK", "Mathematik", "Mathematik", "Englisch",
"Englisch"), Level = c(20L, 21L, 20L, 21L, 20L, 21L, 20L, 21L, 20L, 21L, 20L, 21L, 20L, 21L, 20L, 21L, 20L, 21L, 20L, 21L), Students = c(27104L, 24841L, 29945L, 2862L, 29907L, 27802L, 28837L, 26357L, 31815L, 2880L, 31721L, 29115L, 30324L, 28064L, 32924L, 3508L, 32862L, 30862L, 32124L, 29624L)), row.names = c(NA, -20L), groups = structure(list(Year = c(2005L, 2005L, 2005L, 2006L, 2006L, 2006L, 2007L, 2007L, 2007L, 2008L), Subject = c("Englisch", "GK", "Mathematik", "Englisch", "GK", "Mathematik", "Englisch", "GK", "Mathematik", "Englisch"), .rows = structure(list(1:2, 3:4, 5:6, 7:8, 9:10, 11:12, 13:14, 15:16, 17:18, 19:20), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", "tbl_df", "tbl", "data.frame")) -> rep
Year | Subject | Students | Level |
---|---|---|---|
2000 | English | 100 | 1 |
2000 | Maths | 50 | 1 |
2000 | ... | ... | ... |
2001 | Maths | 30 | 2 |
2001 | English | 70 | 2 |
After each year, students usually ascend one level. To find out how many students did not ascend a level I need to calculate Level 1 Students of English in 2000 - Level 2 Students of English in 2001
The new data frame should thus look like this:
Year | Subject | Students | Level | Difference |
---|---|---|---|---|
2000 | English | 100 | 1 | NA |
2000 | Maths | 50 | 1 | NA |
2000 | ... | ... | ... | ... |
2001 | Maths | 30 | 2 | 20 |
2001 | English | 70 | 2 | 30 |
I have tried dplyr:lag() but I don't know how to make it work with several groups.
CodePudding user response:
You should be able to get that with lag
after arrange
and group_by
-
library(dplyr)
df %>%
arrange(Subject, Year, Level) %>%
group_by(Subject) %>%
mutate(Difference = lag(Students) - Students)
#You can also use diff
#mutate(Difference = c(NA, -diff(Students)))
CodePudding user response:
We may do this with data.table
library(data.table)
setDT(df)[order(Subject, Year, Level)][,
Difference := shift(Students) - Students), by = Subject]