Home > Back-end >  How to calculate difference between first and last score
How to calculate difference between first and last score

Time:01-31

I want to calculate the difference between the first and last score for each person according to the dates. Original data looks like this:

ID <- c(1,1,1,2,2,3,3,3,3,4)
Score <- c(3,2,1,1,2,0,0,3,4,0)
Date <- c("2020/01/01","2020/01/02","2020/01/03","2020/02/05","2020/02/06","2021/10/01","2021/10/02","2021/10/03","2021/10/04","2022/03/01")

a <- data.frame(ID,Score,Date)

desired dataset:

diff_first_last <- c(-2,-2,-2,1,1,4,4,4,4,0)
b <- data.frame(ID,Score,Date,diff_first_last)

Is there a way to do this easily because I want to apply it to a much bigger dataset. I would appreciate all the help there is! Thanks!!!

CodePudding user response:

In base you can use ave assuming it is sorted by dates.

ave(a$Score, a$ID, FUN = \(x) x[length(x)] - x[1])
# [1] -2 -2 -2  1  1  4  4  4  4  0

In case it is not sorted.

a <- a[order(as.Date(a$Date)),]

Benchmark

a <- data.frame(ID = rep(1:9999, each=10), Score <- c(3,2,1,1,2,0,0,3,4,0),
      Date = c("2020/01/01","2020/01/02","2020/01/03","2020/02/05","2020/02/06",
           "2021/10/01","2021/10/02","2021/10/03","2021/10/04","2022/03/01"))

library(dplyr)
bench::mark(check=FALSE,
DPH = {dplyr::group_by(a, ID) %>%
    dplyr::mutate(Date = lubridate::ymd(Date)) %>% 
    dplyr::arrange(Date) %>%
    mutate(diff_first_last = dplyr::last(Score) - dplyr::first(Score)) %>%
    dplyr::ungroup()},
GKi = {. <- a[order(as.Date(a$Date)),]
  cbind(., diff_first_last = ave(.$Score, .$ID, FUN = \(x) x[length(x)] - x[1]))
  }
)
#  expression      min  median itr/s…¹ mem_a…² gc/se…³ n_itr  n_gc total…⁴ result
#  <bch:expr> <bch:tm> <bch:t>   <dbl> <bch:b>   <dbl> <int> <dbl> <bch:t> <list>
#1 DPH           10.3s   10.3s  0.0969   137MB    6.88     1    71   10.3s <NULL>
#2 GKi         156.1ms 159.2ms  5.79    26.5MB    1.93     3     1 518.2ms <NULL>

The base solution is in this case about 60 times faster and uses 1/5 of memory.

CodePudding user response:

this is one possible dplyr solution:

library(dplyr)
# build groupings by ID
dplyr::group_by(a, ID) %>%
    # convert date from text to date to be able to order by it
    dplyr::mutate(Date = lubridate::ymd(Date)) %>% 
    # order by date just to be sure
    dplyr::arrange(Date) %>%
    # calculate first last diference 
    mutate(diff_first_last = dplyr::last(Score) - dplyr::first(Score)) %>%
    # ungroup to prevent unwanted behaviour downstream
    dplyr::ungroup()

 # A tibble: 10 x 4
      ID Score Date       diff_first_last
   <dbl> <dbl> <date>               <dbl>
 1     1     3 2020-01-01              -2
 2     1     2 2020-01-02              -2
 3     1     1 2020-01-03              -2
 4     2     1 2020-02-05               1
 5     2     2 2020-02-06               1
 6     3     0 2021-10-01               4
 7     3     0 2021-10-02               4
 8     3     3 2021-10-03               4
 9     3     4 2021-10-04               4
10     4     0 2022-03-01               0
  •  Tags:  
  • r
  • Related