I have big data frame (dim: 12867779x5) which looks like that:
id | group | date1 | date 2 | icf |
---|---|---|---|---|
id1 | 2 | 2020-03-17 | 2019-06-05 | |
id1 | 3 | 2020-04-03 | 2019-05-09 | |
id2 | 2 | 2020-04-10 | 2019-07-04 | |
id2 | 3 | 2021-04-1 | 2020-06-01 | |
id3 | 1 | 2020-04-13 | 2019-07-07 | |
id3 | 2 | 2021-04-10 | 2020-06-01 | |
id3 | 3 | 2020-04-10 | 2019-07-04 | |
id3 | 4 | 2021-04-13 | 2020-06-01 |
Desired output:
id | group | date1 | date 2 | icf |
---|---|---|---|---|
id1 | 3 | 2020-04-03 | 2019-05-09 | 0 |
id2 | 2 | 2020-04-10 | 2019-07-04 | 52 |
id2 | 3 | 2021-04-01 | 2020-06-01 | 0 |
id3 | 1 | 2020-04-13 | 2019-07-07 | 49 |
id3 | 2 | 2021-04-10 | 2020-06-01 | -646 |
id3 | 3 | 2020-04-10 | 2019-07-04 | 52 |
id3 | 4 | 2021-04-13 | 2020-06-01 | 0 |
To calculate icf I need to check if the id's from row i and i 1 are the same. If yes icf = date2(i 1) - date1(i).
I wrote this function to calculate icf, but it's too slow. I'm looking for ways to speed it up, I was thinking about using the apply function but I don't have idea how to re-write this icfCalculation fucntion.
icfCalculation <- function(dataFrame){
nr <- nrow(dataFrame) - 1
for (i in 1:nr) {
if(dataFrame[i, 1] == dataFrame[i 1, 1]){
dataFrame[i,5] = dataFrame[i 1, 4] - dataFrame[i, 3]
}
else{
dataFrame[i,5] = 0
}
}
return(dataFrame)
}
CodePudding user response:
library(dplyr)
library(tidyr)
df %>%
mutate(icf = replace_na(ifelse(id == lead(id), lead(date2) - date1, 0), 0))
Rather than use tidyr::replace_na
you could also specify the default
argument of lead
.
Base R
A base R approach would be something like:
df$icf <- with(df, ifelse(id == c(id[2:nrow(df)], NA), c(date2[2:nrow(df)], NA) - date1, 0))
Output
id group date1 date2 icf
1 id1 2 2020-03-17 2019-06-05 -313
2 id1 3 2020-04-03 2019-05-09 0
3 id2 2 2020-04-10 2019-07-04 52
4 id2 3 2021-04-01 2020-06-01 0
5 id3 1 2020-04-13 2019-07-07 49
6 id3 2 2021-04-10 2020-06-01 -646
7 id3 3 2020-04-10 2019-07-04 52
8 id3 4 2021-04-13 2020-06-01 0
CodePudding user response:
Thanks for putting expected output. This is not the same as what you have put - but it does give the same results as your function and should be significantly quicker to thanks to the data.table
internal optimisations:
library(data.table)
# Read in data
dat <- read.table(text = "id group date1 date2
id1 2 2020-03-17 2019-06-05
id1 3 2020-04-03 2019-05-09
id2 2 2020-04-10 2019-07-04
id2 3 2021-04-1 2020-06-01
id3 1 2020-04-13 2019-07-07
id3 2 2021-04-10 2020-06-01
id3 3 2020-04-10 2019-07-04
id3 4 2021-04-13 2020-06-01",
h = T,
colClasses = c("character", "character", "Date", "Date")
)
# Make it a data.table
setDT(dat)
dat[, icf := fifelse(
id == shift(id, type = "lead"),
as.integer(
shift(date2, type = "lead") - date1
),
0)
]
dat
# id group date1 date2 icf
# 1: id1 2 2020-03-17 2019-06-05 -313
# 2: id1 3 2020-04-03 2019-05-09 0
# 3: id2 2 2020-04-10 2019-07-04 52
# 4: id2 3 2021-04-01 2020-06-01 0
# 5: id3 1 2020-04-13 2019-07-07 49
# 6: id3 2 2021-04-10 2020-06-01 -646
# 7: id3 3 2020-04-10 2019-07-04 52
# 8: id3 4 2021-04-13 2020-06-01 NA
If you want the last NA to be 0, just add dat$icf[nrow(dat)] <- 0
.