I want to calculate the difference between the first and last score for each person according to the dates. Original data looks like this:
ID <- c(1,1,1,2,2,3,3,3,3,4)
Score <- c(3,2,1,1,2,0,0,3,4,0)
Date <- c("2020/01/01","2020/01/02","2020/01/03","2020/02/05","2020/02/06","2021/10/01","2021/10/02","2021/10/03","2021/10/04","2022/03/01")
a <- data.frame(ID,Score,Date)
desired dataset:
diff_first_last <- c(-2,-2,-2,1,1,4,4,4,4,0)
b <- data.frame(ID,Score,Date,diff_first_last)
Is there a way to do this easily because I want to apply it to a much bigger dataset. I would appreciate all the help there is! Thanks!!!
CodePudding user response:
In base you can use ave
assuming it is sorted by dates.
ave(a$Score, a$ID, FUN = \(x) x[length(x)] - x[1])
# [1] -2 -2 -2 1 1 4 4 4 4 0
In case it is not sorted.
a <- a[order(as.Date(a$Date)),]
Benchmark
a <- data.frame(ID = rep(1:9999, each=10), Score <- c(3,2,1,1,2,0,0,3,4,0),
Date = c("2020/01/01","2020/01/02","2020/01/03","2020/02/05","2020/02/06",
"2021/10/01","2021/10/02","2021/10/03","2021/10/04","2022/03/01"))
library(dplyr)
bench::mark(check=FALSE,
DPH = {dplyr::group_by(a, ID) %>%
dplyr::mutate(Date = lubridate::ymd(Date)) %>%
dplyr::arrange(Date) %>%
mutate(diff_first_last = dplyr::last(Score) - dplyr::first(Score)) %>%
dplyr::ungroup()},
GKi = {. <- a[order(as.Date(a$Date)),]
cbind(., diff_first_last = ave(.$Score, .$ID, FUN = \(x) x[length(x)] - x[1]))
}
)
# expression min median itr/s…¹ mem_a…² gc/se…³ n_itr n_gc total…⁴ result
# <bch:expr> <bch:tm> <bch:t> <dbl> <bch:b> <dbl> <int> <dbl> <bch:t> <list>
#1 DPH 10.3s 10.3s 0.0969 137MB 6.88 1 71 10.3s <NULL>
#2 GKi 156.1ms 159.2ms 5.79 26.5MB 1.93 3 1 518.2ms <NULL>
The base solution is in this case about 60 times faster and uses 1/5 of memory.
CodePudding user response:
this is one possible dplyr solution:
library(dplyr)
# build groupings by ID
dplyr::group_by(a, ID) %>%
# convert date from text to date to be able to order by it
dplyr::mutate(Date = lubridate::ymd(Date)) %>%
# order by date just to be sure
dplyr::arrange(Date) %>%
# calculate first last diference
mutate(diff_first_last = dplyr::last(Score) - dplyr::first(Score)) %>%
# ungroup to prevent unwanted behaviour downstream
dplyr::ungroup()
# A tibble: 10 x 4
ID Score Date diff_first_last
<dbl> <dbl> <date> <dbl>
1 1 3 2020-01-01 -2
2 1 2 2020-01-02 -2
3 1 1 2020-01-03 -2
4 2 1 2020-02-05 1
5 2 2 2020-02-06 1
6 3 0 2021-10-01 4
7 3 0 2021-10-02 4
8 3 3 2021-10-03 4
9 3 4 2021-10-04 4
10 4 0 2022-03-01 0