Home > database >  How to add missing month in a dataframe?
How to add missing month in a dataframe?

Time:07-14

I would like to ask how can I add missing month to a specific column in a dataframe.

   starttime
1 2016-02-26
2 2016-04-12
3 2016-04-22
4 2016-08-04
5 2016-09-15
6 2016-09-16
7 2016-09-20
8 2016-09-22
9 2017-06-02

I'm wishing to transform this into 2016-02-26 -> 2016-03-01 -> 2016-04-12 -> 2016-04-22 ->2015-05-01 (With an NA Value to each of the missing date frequencies).

CodePudding user response:

Surprisingly challenging. what you want is to include missing months into your vector. For this you have to compute all months between the first and last dates, then check which ones your data already contains.

library(lubridate)
mydates <- as_date(c("2001-05-04", "2001-05-30", "2001-07-15", "2001-10-20"))
n <- length(mydates)

mymonths <- month(mydates)
myyears <- year(mydates)
df <- data.frame(mydates, myyears, mymonths)

firstdate <- floor_date(min(mydates), unit="month")
lastdate <- floor_date(max(mydates), unit = "month")
nbmonths <- as.numeric(round((lastdate - firstdate)/(365.25/12)))

fulldates <- firstdate%m % months(0:nbmonths)
fullmonths <- month(fulldates)
fullyears <- year(fulldates)
fullyearmonths <- paste(fullyears, fullmonths, sep='-')

toadd <- as_date(ym(fullyearmonths[!fullyearmonths %in% myyearmonths]))
result <- c(mydates, toadd)
result <- result[order(result)]
[1] "2001-05-04" "2001-05-30" "2001-06-01" "2001-07-15" "2001-08-01" "2001-09-01" "2001-10-20"

CodePudding user response:

You may merge with an auxiliary data frame containing the missing dates. "Date" format is required. In a function f we use the range of the dates, use the substrings several times (i.e. cut away the days) concatenate them with the original dates and sort the thing.

f <- \(x) {
  sq <- do.call(seq.Date, c(as.list(as.Date(paste0(substr(range(as.Date(x)), 1, 7), '-01'))), 'month'))
  sort(c(as.Date(x), sq[substr(sq, 1, 7) %in% substr(x, 1, 7)]))
}

merge(transform(df, starttime=as.Date(starttime)), data.frame(starttime=f(df$starttime)), all=TRUE)
#     starttime  X
# 1  2016-02-01 NA
# 2  2016-02-26  0
# 3  2016-04-01 NA
# 4  2016-04-12  0
# 5  2016-04-22  0
# 6  2016-08-01 NA
# 7  2016-08-04  0
# 8  2016-09-01 NA
# 9  2016-09-15  0
# 10 2016-09-16  0
# 11 2016-09-20  0
# 12 2016-09-22  0
# 13 2017-06-01 NA
# 14 2017-06-02  0

Data:

x <- c('2016-02-26', '2016-04-12', '2016-04-22', '2016-08-04', '2016-09-15', '2016-09-16', '2016-09-20', '2016-09-22', '2017-06-02')
df <- data.frame(starttime=x, X=0)

CodePudding user response:

We could use seperate the dates and use complete:

library(tidyverse)

df <- tibble(starttime = as_date(c("2016-02-26",
                                   "2016-04-12",
                                   "2016-04-22",
                                   "2016-08-04",
                                   "2016-09-15",
                                   "2016-09-16",
                                   "2016-09-20",
                                   "2016-09-22",
                                   "2017-06-02")))
                                 
df |>                                 
  mutate(temp_day = day(starttime),
         temp_month = month(starttime),
         temp_year = year(starttime)) |>
  complete(temp_month = 1:12,
           temp_year = min(temp_year):max(temp_year)) |>
  mutate(temp_day = ifelse(is.na(temp_day), 1, temp_day),
         starttime = as_date(paste(temp_year, temp_month, temp_day, sep = '-'))) |>
  select(-starts_with("temp")) |>
  arrange(starttime)

Output:

# A tibble: 28 × 1
   starttime 
   <date>    
 1 2016-01-01
 2 2016-02-26
 3 2016-03-01
 4 2016-04-12
 5 2016-04-22
 6 2016-05-01
 7 2016-06-01
 8 2016-07-01
 9 2016-08-04
10 2016-09-15
11 2016-09-16
12 2016-09-20
13 2016-09-22
14 2016-10-01
15 2016-11-01
16 2016-12-01
17 2017-01-01
18 2017-02-01
19 2017-03-01
20 2017-04-01
21 2017-05-01
22 2017-06-02
23 2017-07-01
24 2017-08-01
25 2017-09-01
26 2017-10-01
27 2017-11-01
28 2017-12-01
  • Related