I have a dataframe with approximately 3 million rows. Each row is assigned a unique ID and has up to 4 dates. I wish to create a set of new columns for month and year (i.e. Jan-21, Feb-21, Mar-21, etc) and assign a value of "0" for each month/year prior to the first date, and then a value of "1" for the month/year containing the date for each ID, and maintain the value of "1" in each subsequent month/year column until the next column that matches the 2nd date.
I understand that it's easier to help me with examples, so I have put together this dput output with an example of what my current data looks like:
structure(list(id = c(1, 2, 3, 4, 5), date1 = structure(c(1623801600,
1615420800, 1654560000, 1620259200, 1615248000), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), date2 = structure(c(1629158400, 1621987200,
1658448000, 1623974400, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
date3 = structure(c(NA, 1630454400, 1662076800, 1647907200,
NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"), date4 = structure(c(NA,
1639008000, NA, NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -5L))
And this is what I would like it to look like:
structure(list(id = c(1, 2, 3, 4, 5), `Mar-21` = c(0, 1, 0, 0,
1), `Apr-21` = c(0, 1, 0, 0, 1), `May-21` = c(0, 2, 0, 1, 1),
`Jun-21` = c(1, 2, 0, 2, 1), `Jul-21` = c(1, 2, 0, 2, 1),
`Aug-21` = c(2, 2, 0, 2, 1), `Sep-21` = c(2, 3, 0, 2, 1),
`Oct-21` = c(2, 3, 0, 2, 1), `Nov-21` = c(2, 3, 0, 2, 1),
`Dec-21` = c(2, 4, 0, 2, 1), `Jan-22` = c(2, 4, 0, 2, 1),
`Feb-22` = c(2, 4, 0, 2, 1), `Mar-22` = c(2, 4, 0, 3, 1),
`Apr-22` = c(2, 4, 0, 3, 1), `May-22` = c(2, 4, 0, 3, 1),
`Jun-22` = c(2, 4, 1, 3, 1), `Jul-22` = c(2, 4, 2, 3, 1),
`Aug-22` = c(2, 4, 2, 3, 1), `Sep-22` = c(2, 4, 3, 3, 1)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -5L))
Just a note that I have this dataset in both wide and long format, in case using it in a long format makes more sense.
Thank you!
CodePudding user response:
This was a fun exercise! I'm sure there are a billion ways to do this more efficiently, but I think this works and was a fun puzzle for me. I first put the dates into long format to get a min and max. Then I made a sequence of those dates by month. I then used expand grid to make all combinations of the months with each ID to join it to the original data frame. Then I just summed how many dates1:4 were greater then the months in the list. I had to use floor_date to change dates1:4 to the first of the month. Hopefully this helps!
library(dplyr)
library(lubridate)
library(tidyr)
dat2<-dat%>%
tidyr::pivot_longer(cols = -id, values_drop_na = T)
dat_min_max<-data.frame("Min" = min(dat2$value), "Max" = max(dat2$value))
month_seq<-seq(dat_min_max$Min, dat_min_max$Max months(1), by = "month")
dat3<-dat%>%
mutate(date1 = floor_date(date1, "month"),
date2 = floor_date(date2, "month"),
date3 = floor_date(date3, "month"),
date4 = floor_date(date4, "month")
)%>%
left_join(expand.grid(dat$id, month_seq), by = c("id" = "Var1"))%>%
rowwise()%>%
mutate(c = sum(date1 <= Var2, date2 <= Var2, date3 <= Var2, date4 <= Var2, na.rm = T))%>%
mutate(Var2 = format(Var2, "%b-%y"))%>%
select(-date1, -date2, -date3, -date4)%>%
tidyr::pivot_wider(names_from = Var2, values_from = c)