Home > other >  Expand number of rows with dates between startday and endday with new Date column
Expand number of rows with dates between startday and endday with new Date column

Time:03-04

I would like to expand the rows (> 10 million rows) in my dataframe based on the dates between the startdate and enddate columns

ID | Color | StartDate | EndDate    | Days 
1    blue    2022/01/01  2022/01/04   4
2    red     2022/01/01  2022/01/02   2

should result in:

ID | Color | StartDate | EndDate    | Days | New_Date
1    blue    2022/01/01  2022/01/04   4      2022/01/01
1    blue    2022/01/01  2022/01/04   4      2022/01/02
1    blue    2022/01/01  2022/01/04   4      2022/01/03
1    blue    2022/01/01  2022/01/04   4      2022/01/04
2    red     2022/01/01  2022/01/02   2      2022/01/01
2    red     2022/01/01  2022/01/02   2      2022/01/02

I thought about mutating rowwise with tidyr but couldn't get any satisfying results.

CodePudding user response:

This is a job for data.table, especially with 10 million rows.

df[rep(ID,Days)][, NewDate:=df[, seq(StartDate,EndDate,1), by=ID]$V1]

Output:

      ID  Color  StartDate    EndDate  Days    NewDate
   <num> <char>     <Date>     <Date> <num>     <Date>
1:     1   blue 2022-01-01 2022-01-04     4 2022-01-01
2:     1   blue 2022-01-01 2022-01-04     4 2022-01-02
3:     1   blue 2022-01-01 2022-01-04     4 2022-01-03
4:     1   blue 2022-01-01 2022-01-04     4 2022-01-04
5:     2    red 2022-01-01 2022-01-02     2 2022-01-01
6:     2    red 2022-01-01 2022-01-02     2 2022-01-02

Input:

df <- data.table(
  ID = c(1,2),
  Color = c("blue", "red"),
  StartDate = rep(as.Date("2022/01/01"),2),
  EndDate = c(as.Date("2022/01/04"), as.Date("2022/01/02")),
  Days = c(4,2)
)

CodePudding user response:

Assume you want to "expand" by ID, you can first change your "Date" columns into class Date, then use seq() to generate a sequence of dates. Since This method cannot append rows to your dataframe, therefore it's stored in a list. Finally unnest the list into rows.

library(tidyverse)

df %>% 
  group_by(ID) %>% 
  mutate(across(ends_with("Date"), ~as.Date(.)),
         newDate = list(seq(StartDate, EndDate, by = "day"))) %>% 
  unnest(cols = newDate)

# A tibble: 6 x 6
# Groups:   ID [2]
     ID Color StartDate  EndDate     Days newDate   
  <int> <chr> <date>     <date>     <int> <date>    
1     1 blue  2022-01-01 2022-01-04     4 2022-01-01
2     1 blue  2022-01-01 2022-01-04     4 2022-01-02
3     1 blue  2022-01-01 2022-01-04     4 2022-01-03
4     1 blue  2022-01-01 2022-01-04     4 2022-01-04
5     2 red   2022-01-01 2022-01-02     2 2022-01-01
6     2 red   2022-01-01 2022-01-02     2 2022-01-02
  • Related