Dates subtraction from different rows of data frame-CodePudding

I have big data frame (dim: 12867779x5) which looks like that:

id	group	date1	date 2
id1	2	2020-03-17	2019-06-05
id1	3	2020-04-03	2019-05-09
id2	2	2020-04-10	2019-07-04
id2	3	2021-04-1	2020-06-01
id3	1	2020-04-13	2019-07-07
id3	2	2021-04-10	2020-06-01
id3	3	2020-04-10	2019-07-04
id3	4	2021-04-13	2020-06-01

Desired output:

id	group	date1	date 2	icf
id1	3	2020-04-03	2019-05-09	0
id2	2	2020-04-10	2019-07-04	52
id2	3	2021-04-01	2020-06-01	0
id3	1	2020-04-13	2019-07-07	49
id3	2	2021-04-10	2020-06-01	-646
id3	3	2020-04-10	2019-07-04	52
id3	4	2021-04-13	2020-06-01	0

To calculate icf I need to check if the id's from row i and i 1 are the same. If yes icf = date2(i 1) - date1(i).

I wrote this function to calculate icf, but it's too slow. I'm looking for ways to speed it up, I was thinking about using the apply function but I don't have idea how to re-write this icfCalculation fucntion.

icfCalculation <- function(dataFrame){
  nr <- nrow(dataFrame) - 1
  for (i in 1:nr) {
    if(dataFrame[i, 1] == dataFrame[i 1, 1]){
      dataFrame[i,5] = dataFrame[i 1, 4] - dataFrame[i, 3]
    } 
    else{
      dataFrame[i,5] = 0
    }
  }
  return(dataFrame)
}

CodePudding user response：

library(dplyr)
library(tidyr)

df %>% 
  mutate(icf = replace_na(ifelse(id == lead(id), lead(date2) - date1, 0), 0))

Rather than use tidyr::replace_na you could also specify the default argument of lead.

Base R

A base R approach would be something like:

df$icf <- with(df, ifelse(id == c(id[2:nrow(df)], NA), c(date2[2:nrow(df)], NA) - date1, 0))

Output

   id group      date1      date2  icf
1 id1     2 2020-03-17 2019-06-05 -313
2 id1     3 2020-04-03 2019-05-09    0
3 id2     2 2020-04-10 2019-07-04   52
4 id2     3 2021-04-01 2020-06-01    0
5 id3     1 2020-04-13 2019-07-07   49
6 id3     2 2021-04-10 2020-06-01 -646
7 id3     3 2020-04-10 2019-07-04   52
8 id3     4 2021-04-13 2020-06-01    0

CodePudding user response：

Thanks for putting expected output. This is not the same as what you have put - but it does give the same results as your function and should be significantly quicker to thanks to the data.table internal optimisations:

library(data.table)

# Read in data 
dat  <- read.table(text = "id   group   date1   date2   
id1 2   2020-03-17  2019-06-05  
id1 3   2020-04-03  2019-05-09  
id2 2   2020-04-10  2019-07-04  
id2 3   2021-04-1   2020-06-01  
id3 1   2020-04-13  2019-07-07  
id3 2   2021-04-10  2020-06-01  
id3 3   2020-04-10  2019-07-04  
id3 4   2021-04-13  2020-06-01", 
    h = T, 
    colClasses = c("character", "character", "Date", "Date")
)

# Make it a data.table
setDT(dat)

dat[, icf := fifelse(
    id == shift(id, type = "lead"), 
    as.integer(
        shift(date2, type = "lead") - date1
        ), 
    0)
]
dat
#     id group      date1      date2  icf
# 1: id1     2 2020-03-17 2019-06-05 -313
# 2: id1     3 2020-04-03 2019-05-09    0
# 3: id2     2 2020-04-10 2019-07-04   52
# 4: id2     3 2021-04-01 2020-06-01    0
# 5: id3     1 2020-04-13 2019-07-07   49
# 6: id3     2 2021-04-10 2020-06-01 -646
# 7: id3     3 2020-04-10 2019-07-04   52
# 8: id3     4 2021-04-13 2020-06-01   NA

If you want the last NA to be 0, just add dat$icf[nrow(dat)] <- 0.