I keep running into a "NAs introduced by coercion"
error.
- I have multiple dataframes (close to 100) with several different 'Date' columns, including some called 'Period'... these are formatted slightly differently
- The goal is to quickly iterate over the dataframes, identify any 'Period' or 'Date' columns and increase the date value by 1 year.
- Some of the values in 'Period' or 'Date' columns may be blank/missing
- This needs to be dynamic in the event another 'Date' column is added to one of the dataframes in the future
Here a simplified version of the problem I'm working on,
[[1]]
grp Period Date_Begin Date_End col4X col5Y
1 A 0 2
2 A 0 2
3 A 201901 2019-01-31 2019-03-31 2
4 A 201901 2019-01-13 2019-04-25 1.5 2
5 A 201902 2019-02-01 2019-03-01 1.75 2
6 A 201902 2019-02-01 2019-06-30 1 2
7 A 201903 2019-03-25 2019-07-25 <NA> 2
[[2]]
grp Period Date_Start Date_End Expected_Date col4X col5Y
1 A 0 2
2 A 0 2
3 A 201904 2019-04-31 2019-07-31 2019-02-31 2
4 A 201904 2019-01-13 2019-04-25 2019-06-25 1.5 2
5 A 201907 2019-02-01 2019-03-11 2019-03-06 1.75 2
6 A 201907 2019-02-01 2019-06-25 2019-06-25 1 2
7 A 201908 2019-03-25 2019-07-20 2019-07-20 <NA> 2
and the code I've tried:
rep_fun <- function(df) {
df[, grep("Period", names(df))] <- paste(as.numeric(substr(as.character("Period"), 1, 4)) 1,
str_sub(as.character("Period"), - 2, - 1), sep="", collapse=NULL)
df[, grep("Date", names(df))] <- paste(as.numeric(substr(as.character(df[, grep("Date", names(df))]), 1, 4)) 1,
str_sub(as.character(df[, grep("Date", names(df))]), - 6, - 1), sep="", collapse=NULL)
df
}
lapply(df_list, function(x) rep_fun(x))
What am I getting wrong? I've tried adding is.na
criteria, and manually writing each column name (not dynamic).
Data
df1x <- data.frame(grp=c("A", "A", "A", "A", "A", "A", "A"),
Period=c('', '', '201901', '201901', '201902', '201902', '201903'),
Date_Begin=c('', '', '2019-01-31', '2019-01-13', '2019-02-01', '2019-02-01', '2019-03-25'),
Date_End=c('', '', '2019-03-31', '2019-04-25', '2019-03-01', '2019-06-30', '2019-07-25'),
col4X=c(0, 0, "", 1.5, 1.75, 1, NA),
col5Y=c(2, 2, 2, 2, 2, 2, 2))
df2x <- data.frame(grp=c("A", "A", "A", "A", "A", "A", "A"),
Period =c('', '', '201904', '201904', '201907', '201907', '201908'),
Date_Start=c('', '', '2019-04-31', '2019-01-13', '2019-02-01', '2019-02-01', '2019-03-25'),
Date_End=c('', '', '2019-07-31', '2019-04-25', '2019-03-11', '2019-06-25', '2019-07-20'),
Expected_Date=c('', '', '2019-02-31', '2019-06-25', '2019-03-06', '2019-06-25', '2019-07-20'),
col4X=c(0, 0, "", 1.5, 1.75, 1, NA),
col5Y=c(2, 2, 2, 2, 2, 2, 2))
df_list <- list(df1x, df2x)
CodePudding user response:
In my opinion, the best way to deal with dates in weird formats is to convert them to dates with as.Date
, do whatever operation you need to do, then convert it back to the original format using format
. You can build a base R function with this principle, but my answer uses dplyr
.
rep_fun <- function(df){
mutate(df2x, across(matches("Date"), ~ as.Date(.) 365),
across(matches("Period"), ~ (as.Date(paste0(., "01"), "%Y%m%d") 365) %>% format("%Y%m")))
}
We are mutating every column that has "Date"
in the name, turning into date, and adding 365 days. Then we mutate the "Period"
columns, adding a day "01"
, turning it into date (specifying the format "%Y%m%d"
), adding 365 days, and formatting it back to the old format.
You could also use the lubridate package and change 365
to years(1)
, which increases the year by one, regardless of how many days go by.
There were some weird dates in your example, like '2019-02-31'
, which doesn't exist. Was that intended? Does the code needs to handle that? Because my does not. I changed those days to dates that exist to run the example.
Result:
> lapply(df_list, function(x) rep_fun(x))
[[1]]
grp Period Date_Begin Date_End col4X col5Y
1 A <NA> <NA> <NA> 0 2
2 A <NA> <NA> <NA> 0 2
3 A 202001 2020-01-31 2020-03-30 2
4 A 202001 2020-01-13 2020-04-24 1.5 2
5 A 202002 2020-02-01 2020-02-29 1.75 2
6 A 202002 2020-02-01 2020-06-29 1 2
7 A 202003 2020-03-24 2020-07-24 <NA> 2
[[2]]
grp Period Date_Start Date_End Expected_Date col4X col5Y
1 A <NA> <NA> <NA> <NA> 0 2
2 A <NA> <NA> <NA> <NA> 0 2
3 A 202004 2020-04-29 2020-07-30 2020-02-28 2
4 A 202004 2020-01-13 2020-04-24 2020-06-24 1.5 2
5 A 202007 2020-02-01 2020-03-10 2020-03-05 1.75 2
6 A 202007 2020-02-01 2020-06-24 2020-06-24 1 2
7 A 202008 2020-03-24 2020-07-19 2020-07-19 <NA> 2
CodePudding user response:
Define a function incr1
that adds exactly one year using seq.Date
and wrap it in rep_fun
. This will automatically take account for leap years and daylight saving time stuff. We make a case handling for format of the Period or Date columns.
Invalid dates in your data such as 2019-04-31
will get deleted (throws a warning
), fix this first. Also columns with unexpected will be returned NA
with a warning
. I think that's cool because correct handling of dates might be important in life.
rep_fun <- \(df) {
incr1 <- \(x) {
x[x %in% ''] <- NA
na <- is.na(x)
lna <- length(na[na])
nna <- which(!na)
if (all(grepl('^\\d{6}$', na.omit(x)))) {
x[nna] <- paste0(x[nna], '01')
x <- as.Date(x, '%Y%m%d')
for (i in nna) {
x[i] <- seq.Date(x[i], by='year', length.out=2)[[2]]
}
out <- format(x, '%Y%m')
}
else if (all(grepl('^\\d{4}-\\d{2}-\\d{2}$', na.omit(x)))) {
x <- as.Date(x, '%Y-%m-%d')
nna <- which(!is.na(x))
for (i in nna) {
x[i] <- seq.Date(x[i], by='year', length.out=2)[[2]]
}
out <- x
} else {
warning('Unexpected column format, NAs created.')
out <- NA_character_
}
if (length(is.na(out)) > lna) warning('Invalid dates detected, NAs created.')
out
}
dc <- grep('Date|Period', names(df)) ## date columns
df[dc] <- lapply(df[dc], incr1)
df
}
Gives
lapply(df_list, rep_fun)
# [[1]]
# grp Period Date_Begin Date_End col4X col5Y
# 1 A <NA> <NA> <NA> 0 2
# 2 A <NA> <NA> <NA> 0 2
# 3 A 202001 2020-01-31 2020-03-31 2
# 4 A 202001 2020-01-13 2020-04-25 1.5 2
# 5 A 202002 2020-02-01 2020-03-01 1.75 2
# 6 A 202002 2020-02-01 2020-06-30 1 2
# 7 A 202003 2020-03-25 2020-07-25 <NA> 2
#
# [[2]]
# grp Period Date_Start Date_End Expected_Date col4X col5Y
# 1 A <NA> <NA> <NA> <NA> 0 2
# 2 A <NA> <NA> <NA> <NA> 0 2
# 3 A 202004 <NA> 2020-07-31 <NA> 2
# 4 A 202004 2020-01-13 2020-04-25 2020-06-25 1.5 2
# 5 A 202007 2020-02-01 2020-03-11 2020-03-06 1.75 2
# 6 A 202007 2020-02-01 2020-06-25 2020-06-25 1 2
# 7 A 202008 2020-03-25 2020-07-20 2020-07-20 <NA> 2
# Warning messages:
# 1: In FUN(X[[i]], ...) : Invalid dates detected, NAs created.
# 2: In FUN(X[[i]], ...) : Invalid dates detected, NAs created.
# 3: In FUN(X[[i]], ...) : Invalid dates detected, NAs created.
# 4: In FUN(X[[i]], ...) : Invalid dates detected, NAs created.
# 5: In FUN(X[[i]], ...) : Invalid dates detected, NAs created.
# 6: In FUN(X[[i]], ...) : Invalid dates detected, NAs created.
# 7: In FUN(X[[i]], ...) : Invalid dates detected, NAs created.
Note R > 4.2 was used.
Data:
df_list <- list(structure(list(grp = c("A", "A", "A", "A", "A", "A", "A"
), Period = c("", "", "201901", "201901", "201902", "201902",
"201903"), Date_Begin = c("", "", "2019-01-31", "2019-01-13",
"2019-02-01", "2019-02-01", "2019-03-25"), Date_End = c("", "",
"2019-03-31", "2019-04-25", "2019-03-01", "2019-06-30", "2019-07-25"
), col4X = c("0", "0", "", "1.5", "1.75", "1", NA), col5Y = c(2,
2, 2, 2, 2, 2, 2)), class = "data.frame", row.names = c(NA, -7L
)), structure(list(grp = c("A", "A", "A", "A", "A", "A", "A"),
Period = c("", "", "201904", "201904", "201907", "201907",
"201908"), Date_Start = c("", "", "2019-04-31", "2019-01-13",
"2019-02-01", "2019-02-01", "2019-03-25"), Date_End = c("",
"", "2019-07-31", "2019-04-25", "2019-03-11", "2019-06-25",
"2019-07-20"), Expected_Date = c("", "", "2019-02-31", "2019-06-25",
"2019-03-06", "2019-06-25", "2019-07-20"), col4X = c("0",
"0", "", "1.5", "1.75", "1", NA), col5Y = c(2, 2, 2, 2, 2,
2, 2)), class = "data.frame", row.names = c(NA, -7L)))