I am working with a large dataset (over 1 million rows) with e.g. two column date and a delay number.
ID col1 Date Delay
1: A 100 2021-05-01 1
2: B 200 2018-04-03 3
3: C 300 2020-02-17 2
I want to duplicate the rows in the table depending on the delay amount, while incrementing the date for each row in a new column:
ID col1 Date Delay New_Date
1: A 100 2021-05-01 1 2021-05-02
2: B 200 2018-04-03 3 2018-04-04
3: B 200 2018-04-03 3 2018-04-05
4: B 200 2018-04-03 3 2018-04-06
5: C 300 2020-02-17 2 2020-02-18
6: C 300 2020-02-17 2 2020-02-19
I am currently doing it with a for each loop, which is extremely inefficient and takes a lot of time.
for(row in 1:nrow(df)) {
delay <- as.numeric(df[row, "Delay"])
tempdf <- df[0,]
for(numberDelay in 1:delay) {
tempdf[numberDelay, ] <- df[row, ]
tempdf[numberDelay, "New_Date"] <- as.Date.character(tempdf[numberDelay, "Date"] as.numeric(numberDelay),
tryFormats = "%Y-%m-%d")
}
result <- rbind(result, tempdf)
}
Context: This would allow me to determine delays which were in the weekend or on national holidays by further comparing the new date with a list of blacklisted dates. Is there an efficient way to do this in R?
Coon
CodePudding user response:
You can try with dplyr
and tidyr
:
library(dplyr)
library(tidyr)
df %>%
rowwise() %>%
mutate(New_Date = list(seq.Date(Date 1, Date Delay, by = "day"))) %>%
unnest(New_Date)
#> # A tibble: 6 x 5
#> ID col1 Date Delay New_Date
#> <chr> <int> <date> <int> <date>
#> 1 A 100 2021-05-01 1 2021-05-02
#> 2 B 200 2018-04-03 3 2018-04-04
#> 3 B 200 2018-04-03 3 2018-04-05
#> 4 B 200 2018-04-03 3 2018-04-06
#> 5 C 300 2020-02-17 2 2020-02-18
#> 6 C 300 2020-02-17 2 2020-02-19
However, considering the context you explained, I think something like this could be more effective for you:
# example of vector of blacklisted days
blacklist_days <- as.Date(c("2020-02-18", "2018-04-04", "2018-04-05"))
df %>%
rowwise() %>%
mutate(Dates = list(seq.Date(Date 1, Date Delay, by = "day"))) %>%
mutate(n_bl = sum(Dates %in% blacklist_days)) %>%
ungroup()
#> # A tibble: 3 x 6
#> ID col1 Date Delay Dates n_bl
#> <chr> <int> <date> <int> <list> <int>
#> 1 A 100 2021-05-01 1 <date [1]> 0
#> 2 B 200 2018-04-03 3 <date [3]> 2
#> 3 C 300 2020-02-17 2 <date [2]> 1
In this way you avoid rows duplication, which could affect your performance.
CodePudding user response:
You can create a data frame of duplicates separately, and then combine them with the original. This uses a loop to go through the different values of Delay
.
> dat <- data.frame(ID = LETTERS[1:3], col1 = 1:3 * 100,
date = as.Date(c('2021-05-01', '2018-04-03', '2020-02-17')),
delay = c(1, 3, 2))
> dat
ID col1 date delay
1 A 100 2021-05-01 1
2 B 200 2018-04-03 3
3 C 300 2020-02-17 2
> dat$sk <- 1:nrow(dat)
> ddup <- data.frame()
> for (i in 2:3) {
dd <- dat[dat$delay >= i, ]
dd <- dat[dat$delay >= i, ]
dd$date <- dd$date 1
ddup <- rbind(ddup, dd)
}
}
> dat <- rbind(dat, ddup)
> dat <- dat[order(dat$sk, dat$date), ]
> dat
ID col1 date delay sk
1 A 100 2021-05-01 1 1
2 B 200 2018-04-03 3 2
22 B 200 2018-04-04 3 2
21 B 200 2018-04-04 3 2
3 C 300 2020-02-17 2 3
31 C 300 2020-02-18 2 3
>