Using R, how to assign a category into a new column based on a date in another column and carry that-CodePudding

I have a dataframe with approximately 3 million rows. Each row is assigned a unique ID and has up to 4 dates. I wish to create a set of new columns for month and year (i.e. Jan-21, Feb-21, Mar-21, etc) and assign a value of "0" for each month/year prior to the first date, and then a value of "1" for the month/year containing the date for each ID, and maintain the value of "1" in each subsequent month/year column until the next column that matches the 2nd date.

I understand that it's easier to help me with examples, so I have put together this dput output with an example of what my current data looks like:

structure(list(id = c(1, 2, 3, 4, 5), date1 = structure(c(1623801600, 
1615420800, 1654560000, 1620259200, 1615248000), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), date2 = structure(c(1629158400, 1621987200, 
1658448000, 1623974400, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    date3 = structure(c(NA, 1630454400, 1662076800, 1647907200, 
    NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"), date4 = structure(c(NA, 
    1639008000, NA, NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -5L))

And this is what I would like it to look like:

structure(list(id = c(1, 2, 3, 4, 5), `Mar-21` = c(0, 1, 0, 0, 
1), `Apr-21` = c(0, 1, 0, 0, 1), `May-21` = c(0, 2, 0, 1, 1), 
    `Jun-21` = c(1, 2, 0, 2, 1), `Jul-21` = c(1, 2, 0, 2, 1), 
    `Aug-21` = c(2, 2, 0, 2, 1), `Sep-21` = c(2, 3, 0, 2, 1), 
    `Oct-21` = c(2, 3, 0, 2, 1), `Nov-21` = c(2, 3, 0, 2, 1), 
    `Dec-21` = c(2, 4, 0, 2, 1), `Jan-22` = c(2, 4, 0, 2, 1), 
    `Feb-22` = c(2, 4, 0, 2, 1), `Mar-22` = c(2, 4, 0, 3, 1), 
    `Apr-22` = c(2, 4, 0, 3, 1), `May-22` = c(2, 4, 0, 3, 1), 
    `Jun-22` = c(2, 4, 1, 3, 1), `Jul-22` = c(2, 4, 2, 3, 1), 
    `Aug-22` = c(2, 4, 2, 3, 1), `Sep-22` = c(2, 4, 3, 3, 1)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -5L))

Just a note that I have this dataset in both wide and long format, in case using it in a long format makes more sense.

Thank you!

CodePudding user response：

This was a fun exercise! I'm sure there are a billion ways to do this more efficiently, but I think this works and was a fun puzzle for me. I first put the dates into long format to get a min and max. Then I made a sequence of those dates by month. I then used expand grid to make all combinations of the months with each ID to join it to the original data frame. Then I just summed how many dates1:4 were greater then the months in the list. I had to use floor_date to change dates1:4 to the first of the month. Hopefully this helps!

library(dplyr)
library(lubridate)
library(tidyr)
dat2<-dat%>%
  tidyr::pivot_longer(cols = -id, values_drop_na = T)

dat_min_max<-data.frame("Min" = min(dat2$value), "Max" = max(dat2$value))
month_seq<-seq(dat_min_max$Min, dat_min_max$Max months(1), by = "month")

dat3<-dat%>%
  mutate(date1 = floor_date(date1, "month"),
         date2 = floor_date(date2, "month"),
         date3 = floor_date(date3, "month"),
         date4 = floor_date(date4, "month")
         )%>%
  left_join(expand.grid(dat$id, month_seq), by = c("id" = "Var1"))%>%
  rowwise()%>%
  mutate(c = sum(date1 <= Var2, date2 <= Var2, date3 <= Var2, date4 <= Var2, na.rm = T))%>%
  mutate(Var2 = format(Var2, "%b-%y"))%>%
  select(-date1, -date2, -date3, -date4)%>%
  tidyr::pivot_wider(names_from = Var2, values_from = c)