How to increment dates of specific (dynamic) columns by one year?-CodePudding

I keep running into a "NAs introduced by coercion" error.

I have multiple dataframes (close to 100) with several different 'Date' columns, including some called 'Period'... these are formatted slightly differently
The goal is to quickly iterate over the dataframes, identify any 'Period' or 'Date' columns and increase the date value by 1 year.
Some of the values in 'Period' or 'Date' columns may be blank/missing
This needs to be dynamic in the event another 'Date' column is added to one of the dataframes in the future

Here a simplified version of the problem I'm working on,

[[1]]
  grp Period Date_Begin   Date_End col4X col5Y
1   A                                  0     2
2   A                                  0     2
3   A 201901 2019-01-31 2019-03-31           2
4   A 201901 2019-01-13 2019-04-25   1.5     2
5   A 201902 2019-02-01 2019-03-01  1.75     2
6   A 201902 2019-02-01 2019-06-30     1     2
7   A 201903 2019-03-25 2019-07-25  <NA>     2

[[2]]
  grp Period Date_Start   Date_End Expected_Date col4X col5Y
1   A                                                0     2
2   A                                                0     2
3   A 201904 2019-04-31 2019-07-31    2019-02-31           2
4   A 201904 2019-01-13 2019-04-25    2019-06-25   1.5     2
5   A 201907 2019-02-01 2019-03-11    2019-03-06  1.75     2
6   A 201907 2019-02-01 2019-06-25    2019-06-25     1     2
7   A 201908 2019-03-25 2019-07-20    2019-07-20  <NA>     2

and the code I've tried:

rep_fun <- function(df) {
  df[, grep("Period", names(df))] <- paste(as.numeric(substr(as.character("Period"), 1, 4))   1, 
                                           str_sub(as.character("Period"), - 2, - 1), sep="", collapse=NULL)
  df[, grep("Date", names(df))] <- paste(as.numeric(substr(as.character(df[, grep("Date", names(df))]), 1, 4))   1, 
                                         str_sub(as.character(df[, grep("Date", names(df))]), - 6, - 1), sep="", collapse=NULL)
  df
}

lapply(df_list, function(x) rep_fun(x))

What am I getting wrong? I've tried adding is.na criteria, and manually writing each column name (not dynamic).

Data

df1x <- data.frame(grp=c("A", "A", "A", "A", "A", "A", "A"), 
                   Period=c('', '', '201901', '201901', '201902', '201902', '201903'), 
                   Date_Begin=c('', '', '2019-01-31', '2019-01-13', '2019-02-01', '2019-02-01', '2019-03-25'), 
                   Date_End=c('', '', '2019-03-31', '2019-04-25', '2019-03-01', '2019-06-30', '2019-07-25'), 
                   col4X=c(0, 0, "", 1.5, 1.75, 1, NA), 
                   col5Y=c(2, 2, 2, 2, 2, 2, 2))

df2x <-  data.frame(grp=c("A", "A", "A", "A", "A", "A", "A"), 
                    Period =c('', '', '201904', '201904', '201907', '201907', '201908'), 
                    Date_Start=c('', '', '2019-04-31', '2019-01-13', '2019-02-01', '2019-02-01', '2019-03-25'), 
                    Date_End=c('', '', '2019-07-31', '2019-04-25', '2019-03-11', '2019-06-25', '2019-07-20'), 
                    Expected_Date=c('', '', '2019-02-31', '2019-06-25', '2019-03-06', '2019-06-25', '2019-07-20'), 
                    col4X=c(0, 0, "", 1.5, 1.75, 1, NA), 
                    col5Y=c(2, 2, 2, 2, 2, 2, 2))

df_list <- list(df1x, df2x)

CodePudding user response：

In my opinion, the best way to deal with dates in weird formats is to convert them to dates with as.Date, do whatever operation you need to do, then convert it back to the original format using format. You can build a base R function with this principle, but my answer uses dplyr.

rep_fun <- function(df){
  mutate(df2x, across(matches("Date"), ~ as.Date(.)   365),
         across(matches("Period"), ~ (as.Date(paste0(., "01"), "%Y%m%d")   365) %>% format("%Y%m")))
}

We are mutating every column that has "Date" in the name, turning into date, and adding 365 days. Then we mutate the "Period" columns, adding a day "01", turning it into date (specifying the format "%Y%m%d"), adding 365 days, and formatting it back to the old format.

You could also use the lubridate package and change 365 to years(1), which increases the year by one, regardless of how many days go by.

There were some weird dates in your example, like '2019-02-31', which doesn't exist. Was that intended? Does the code needs to handle that? Because my does not. I changed those days to dates that exist to run the example.

Result:

> lapply(df_list, function(x) rep_fun(x))
[[1]]
  grp Period Date_Begin   Date_End col4X col5Y
1   A   <NA>       <NA>       <NA>     0     2
2   A   <NA>       <NA>       <NA>     0     2
3   A 202001 2020-01-31 2020-03-30           2
4   A 202001 2020-01-13 2020-04-24   1.5     2
5   A 202002 2020-02-01 2020-02-29  1.75     2
6   A 202002 2020-02-01 2020-06-29     1     2
7   A 202003 2020-03-24 2020-07-24  <NA>     2

[[2]]
  grp Period Date_Start   Date_End Expected_Date col4X col5Y
1   A   <NA>       <NA>       <NA>          <NA>     0     2
2   A   <NA>       <NA>       <NA>          <NA>     0     2
3   A 202004 2020-04-29 2020-07-30    2020-02-28           2
4   A 202004 2020-01-13 2020-04-24    2020-06-24   1.5     2
5   A 202007 2020-02-01 2020-03-10    2020-03-05  1.75     2
6   A 202007 2020-02-01 2020-06-24    2020-06-24     1     2
7   A 202008 2020-03-24 2020-07-19    2020-07-19  <NA>     2

CodePudding user response：

Define a function incr1 that adds exactly one year using seq.Date and wrap it in rep_fun. This will automatically take account for leap years and daylight saving time stuff. We make a case handling for format of the Period or Date columns.

Invalid dates in your data such as 2019-04-31 will get deleted (throws a warning), fix this first. Also columns with unexpected will be returned NA with a warning. I think that's cool because correct handling of dates might be important in life.

rep_fun <- \(df) {
  incr1 <- \(x) {
    x[x %in% ''] <- NA
    na <- is.na(x)
    lna <- length(na[na])
    nna <- which(!na)
    if (all(grepl('^\\d{6}$', na.omit(x)))) {
      x[nna] <- paste0(x[nna], '01')
      x <- as.Date(x, '%Y%m%d')
      for (i in nna) {
        x[i] <- seq.Date(x[i], by='year', length.out=2)[[2]]
      }
      out <- format(x, '%Y%m')
    }
    else if (all(grepl('^\\d{4}-\\d{2}-\\d{2}$', na.omit(x)))) {
      x <- as.Date(x, '%Y-%m-%d')
      nna <- which(!is.na(x))
      for (i in nna) {
        x[i] <- seq.Date(x[i], by='year', length.out=2)[[2]]
      }
      out <- x
    } else {
      warning('Unexpected column format, NAs created.')
      out <- NA_character_
    }
    if (length(is.na(out)) > lna) warning('Invalid dates detected, NAs created.')
    out
  }
  dc <- grep('Date|Period', names(df))  ## date columns
  df[dc] <- lapply(df[dc], incr1)
  df
}

Gives

lapply(df_list, rep_fun)
# [[1]]
#   grp Period Date_Begin   Date_End col4X col5Y
# 1   A   <NA>       <NA>       <NA>     0     2
# 2   A   <NA>       <NA>       <NA>     0     2
# 3   A 202001 2020-01-31 2020-03-31           2
# 4   A 202001 2020-01-13 2020-04-25   1.5     2
# 5   A 202002 2020-02-01 2020-03-01  1.75     2
# 6   A 202002 2020-02-01 2020-06-30     1     2
# 7   A 202003 2020-03-25 2020-07-25  <NA>     2
# 
# [[2]]
#   grp Period Date_Start   Date_End Expected_Date col4X col5Y
# 1   A   <NA>       <NA>       <NA>          <NA>     0     2
# 2   A   <NA>       <NA>       <NA>          <NA>     0     2
# 3   A 202004       <NA> 2020-07-31          <NA>           2
# 4   A 202004 2020-01-13 2020-04-25    2020-06-25   1.5     2
# 5   A 202007 2020-02-01 2020-03-11    2020-03-06  1.75     2
# 6   A 202007 2020-02-01 2020-06-25    2020-06-25     1     2
# 7   A 202008 2020-03-25 2020-07-20    2020-07-20  <NA>     2

# Warning messages:
# 1: In FUN(X[[i]], ...) : Invalid dates detected, NAs created.
# 2: In FUN(X[[i]], ...) : Invalid dates detected, NAs created.
# 3: In FUN(X[[i]], ...) : Invalid dates detected, NAs created.
# 4: In FUN(X[[i]], ...) : Invalid dates detected, NAs created.
# 5: In FUN(X[[i]], ...) : Invalid dates detected, NAs created.
# 6: In FUN(X[[i]], ...) : Invalid dates detected, NAs created.
# 7: In FUN(X[[i]], ...) : Invalid dates detected, NAs created.

Note R > 4.2 was used.

Data:

df_list <- list(structure(list(grp = c("A", "A", "A", "A", "A", "A", "A"
), Period = c("", "", "201901", "201901", "201902", "201902", 
"201903"), Date_Begin = c("", "", "2019-01-31", "2019-01-13", 
"2019-02-01", "2019-02-01", "2019-03-25"), Date_End = c("", "", 
"2019-03-31", "2019-04-25", "2019-03-01", "2019-06-30", "2019-07-25"
), col4X = c("0", "0", "", "1.5", "1.75", "1", NA), col5Y = c(2, 
2, 2, 2, 2, 2, 2)), class = "data.frame", row.names = c(NA, -7L
)), structure(list(grp = c("A", "A", "A", "A", "A", "A", "A"), 
    Period = c("", "", "201904", "201904", "201907", "201907", 
    "201908"), Date_Start = c("", "", "2019-04-31", "2019-01-13", 
    "2019-02-01", "2019-02-01", "2019-03-25"), Date_End = c("", 
    "", "2019-07-31", "2019-04-25", "2019-03-11", "2019-06-25", 
    "2019-07-20"), Expected_Date = c("", "", "2019-02-31", "2019-06-25", 
    "2019-03-06", "2019-06-25", "2019-07-20"), col4X = c("0", 
    "0", "", "1.5", "1.75", "1", NA), col5Y = c(2, 2, 2, 2, 2, 
    2, 2)), class = "data.frame", row.names = c(NA, -7L)))