Home > Enterprise >  Extract year, month and day when dates are non-standard format
Extract year, month and day when dates are non-standard format

Time:10-06

I have a column of dates, I want to extract the year, month and day into separate columns. Unfortunately there are inconsistent entries in the dates column, so the normal solution of using format(as.Date(),"%Y") or lubridate::year() doesn't work.

Here is an example dataframe:

dates_df <- data.frame(dates = c("1985-03-23", "", "1983", "1984-01"))

And here is the desired result:

       dates year month  day
1 1985-03-23 1985     3   23
2            <NA>  <NA> <NA>
3       1983 1983  <NA> <NA>
4    1984-01 1984     1 <NA>

I can achieve the desired result with the following code, but it is very slow on large datasets (>100,000 rows):

dates_df$year <- sapply(dates_df$dates, function(x) unlist(strsplit(x, "\\-"))[1])
dates_df$month <- sapply(dates_df$dates, function(x) unlist(strsplit(x, "\\-"))[2])
dates_df$day <- sapply(dates_df$dates, function(x) unlist(strsplit(x, "\\-"))[3])

My question:

Is there a more efficient (fast) way to extract year, month, day columns from messy date data?

CodePudding user response:

See if this is any faster. I can't test it right at this moment. And I'm not sure how it will handle the blank or non full date values.

df <- data.frame(dates,
             year = as.numeric(format(datetxt, format = "%Y")),
             month = as.numeric(format(datetxt, format = "%m")),
             day = as.numeric(format(datetxt, format = "%d")))

CodePudding user response:

Using strsplit and adapting the lengths.

cbind(dates_df, t(sapply(strsplit(dates_df$dates, '-'), `length<-`, 3)))
#        dates    1    2    3
# 1 1985-03-23 1985   03   23
# 2            <NA> <NA> <NA>
# 3       1983 1983 <NA> <NA>
# 4    1984-01 1984   01 <NA>

With nice names:

cbind(dates_df, `colnames<-`(
  t(sapply(strsplit(dates_df$dates, '-'), `length<-`, 3)), c('year', 'month', 'day')))
#        dates year month  day
# 1 1985-03-23 1985    03   23
# 2            <NA>  <NA> <NA>
# 3       1983 1983  <NA> <NA>
# 4    1984-01 1984    01 <NA>
  • Related