I have a column of dates, I want to extract the year, month and day into separate columns. Unfortunately there are inconsistent entries in the dates column, so the normal solution of using format(as.Date(),"%Y")
or lubridate::year()
doesn't work.
Here is an example dataframe:
dates_df <- data.frame(dates = c("1985-03-23", "", "1983", "1984-01"))
And here is the desired result:
dates year month day
1 1985-03-23 1985 3 23
2 <NA> <NA> <NA>
3 1983 1983 <NA> <NA>
4 1984-01 1984 1 <NA>
I can achieve the desired result with the following code, but it is very slow on large datasets (>100,000 rows):
dates_df$year <- sapply(dates_df$dates, function(x) unlist(strsplit(x, "\\-"))[1])
dates_df$month <- sapply(dates_df$dates, function(x) unlist(strsplit(x, "\\-"))[2])
dates_df$day <- sapply(dates_df$dates, function(x) unlist(strsplit(x, "\\-"))[3])
My question:
Is there a more efficient (fast) way to extract year, month, day columns from messy date data?
CodePudding user response:
See if this is any faster. I can't test it right at this moment. And I'm not sure how it will handle the blank or non full date values.
df <- data.frame(dates,
year = as.numeric(format(datetxt, format = "%Y")),
month = as.numeric(format(datetxt, format = "%m")),
day = as.numeric(format(datetxt, format = "%d")))
CodePudding user response:
Using strsplit
and adapting the length
s.
cbind(dates_df, t(sapply(strsplit(dates_df$dates, '-'), `length<-`, 3)))
# dates 1 2 3
# 1 1985-03-23 1985 03 23
# 2 <NA> <NA> <NA>
# 3 1983 1983 <NA> <NA>
# 4 1984-01 1984 01 <NA>
With nice names:
cbind(dates_df, `colnames<-`(
t(sapply(strsplit(dates_df$dates, '-'), `length<-`, 3)), c('year', 'month', 'day')))
# dates year month day
# 1 1985-03-23 1985 03 23
# 2 <NA> <NA> <NA>
# 3 1983 1983 <NA> <NA>
# 4 1984-01 1984 01 <NA>