Home > OS >  Duplicate rows with conditions in dataframe without looping
Duplicate rows with conditions in dataframe without looping

Time:11-11

I am working with a large dataset (over 1 million rows) with e.g. two column date and a delay number.

   ID    col1    Date         Delay
1:  A     100    2021-05-01    1
2:  B     200    2018-04-03    3
3:  C     300    2020-02-17    2

I want to duplicate the rows in the table depending on the delay amount, while incrementing the date for each row in a new column:

   ID    col1    Date         Delay   New_Date
1:  A     100    2021-05-01    1      2021-05-02
2:  B     200    2018-04-03    3      2018-04-04
3:  B     200    2018-04-03    3      2018-04-05
4:  B     200    2018-04-03    3      2018-04-06
5:  C     300    2020-02-17    2      2020-02-18
6:  C     300    2020-02-17    2      2020-02-19

I am currently doing it with a for each loop, which is extremely inefficient and takes a lot of time.

for(row in 1:nrow(df)) {
  delay <- as.numeric(df[row, "Delay"])
  tempdf <- df[0,]
    
  for(numberDelay in 1:delay) {
    tempdf[numberDelay, ] <- df[row, ]
    tempdf[numberDelay, "New_Date"] <- as.Date.character(tempdf[numberDelay, "Date"]   as.numeric(numberDelay), 
    tryFormats = "%Y-%m-%d")
  }
  result <- rbind(result, tempdf)
}

Context: This would allow me to determine delays which were in the weekend or on national holidays by further comparing the new date with a list of blacklisted dates. Is there an efficient way to do this in R?

Coon

CodePudding user response:

You can try with dplyr and tidyr:

library(dplyr)
library(tidyr)

df %>% 
  rowwise() %>% 
  mutate(New_Date = list(seq.Date(Date   1, Date   Delay, by = "day"))) %>% 
  unnest(New_Date)
#> # A tibble: 6 x 5
#>   ID     col1 Date       Delay New_Date  
#>   <chr> <int> <date>     <int> <date>    
#> 1 A       100 2021-05-01     1 2021-05-02
#> 2 B       200 2018-04-03     3 2018-04-04
#> 3 B       200 2018-04-03     3 2018-04-05
#> 4 B       200 2018-04-03     3 2018-04-06
#> 5 C       300 2020-02-17     2 2020-02-18
#> 6 C       300 2020-02-17     2 2020-02-19

However, considering the context you explained, I think something like this could be more effective for you:

# example of vector of blacklisted days
blacklist_days <- as.Date(c("2020-02-18", "2018-04-04", "2018-04-05"))
df %>% 
  rowwise() %>% 
  mutate(Dates = list(seq.Date(Date   1, Date   Delay, by = "day"))) %>% 
  mutate(n_bl = sum(Dates %in% blacklist_days)) %>% 
  ungroup()
#> # A tibble: 3 x 6
#>   ID     col1 Date       Delay Dates       n_bl
#>   <chr> <int> <date>     <int> <list>     <int>
#> 1 A       100 2021-05-01     1 <date [1]>     0
#> 2 B       200 2018-04-03     3 <date [3]>     2
#> 3 C       300 2020-02-17     2 <date [2]>     1

In this way you avoid rows duplication, which could affect your performance.

CodePudding user response:

You can create a data frame of duplicates separately, and then combine them with the original. This uses a loop to go through the different values of Delay.

> dat <- data.frame(ID = LETTERS[1:3], col1 = 1:3 * 100,
                    date = as.Date(c('2021-05-01', '2018-04-03', '2020-02-17')),

                    delay = c(1, 3, 2))
> dat
  ID col1       date delay
1  A  100 2021-05-01     1
2  B  200 2018-04-03     3
3  C  300 2020-02-17     2
> dat$sk <- 1:nrow(dat)
> ddup <- data.frame()
> for (i in 2:3) {
 
  dd <- dat[dat$delay >= i, ]
    dd <- dat[dat$delay >= i, ]
    dd$date <- dd$date   1

    ddup <- rbind(ddup, dd)
}
 
  }
> dat <- rbind(dat, ddup)
> dat <- dat[order(dat$sk, dat$date), ]
> dat
   ID col1       date delay sk
1   A  100 2021-05-01     1  1
2   B  200 2018-04-03     3  2
22  B  200 2018-04-04     3  2
21  B  200 2018-04-04     3  2
3   C  300 2020-02-17     2  3
31  C  300 2020-02-18     2  3
>
  • Related