In R, using lubridate
, I have dates for records of a species. Some of these dates are records that are specific to day, some are specific to month, and some are specific to year. Many of the dates are recorded in different formats. I have been using lubridate
to parse these date records.
I want to create a column for 'start_date' and a column for 'end_date', to represent the range of possible days (and therefore uncertainty) for any records that are not specific to a particular day.
My original thought was to use the floor_date()
function for the start_date column, and ceiling_date()
function for the end_date column. If a date was already specific to day, then both start and end date would end up the same. If the date was not specific to day, the start and end dates would represent the range.
The problem with this is that floor_date()
and ceiling_date()
only work with data that is already formatted into a date format - and if I use parse_date_time()
from lubridate
to format into a date format, the date is automatically rounded down. I have not found a way to manually adjust rounding in parse_date_time()
.
Is there any way for me to get around this?
Some example code to show my problem:
library(lubridate)
species <- c("mudkip", "treecko", "torchic", "swampert", "galvantula")
raw_date <- c("5 Jan 2013", "February 2018", "10/2001", "25/12/1984", "2020")
df <- data.frame(species, raw_date)
df$start_date <- floor_date(df$raw_date, unit="day")
df$end_date <- ceiling_date(df$raw_date, unit="day")
Error in UseMethod("reclass_date", orig) :
no applicable method for 'reclass_date' applied to an object of class "character"
In addition: Warning message:
All formats failed to parse. No formats found.
Attempt using parse_date_time for illustration:
species <- c("mudkip", "treecko", "torchic", "swampert", "galvantula")
raw_date <- c("5 Jan 2013", "February 2018", "10/2001", "25/12/1984", "2020")
df <- data.frame(species, raw_date)
df$processed_date <- parse_date_time(df$raw_date, orders = c("ymd","dmy", "my", "y"))
df$start_date <- floor_date(df$raw_date, unit="day")
df$end_date <- ceiling_date(df$raw_date, unit="day")
df$start_date[df$species=="torchic"]
#[1] "2001-10-01 UTC"
df$end_date[df$species=="torchic"]
#[1] "2001-10-01 UTC" - I want this to be "2001-10-31 UTC"
df$start_date[df$species=="galvantula"]
#[1] "2020-01-01 UTC"
df$end_date[df$species=="galvantula"]
#[1] "2020-01-01 UTC" - I want this to be "2020-12-31 UTC"
CodePudding user response:
UPDATE WITH SOLUTION:
I have solved this problem myself, using parse_date_time()
and the fact that, if a date is not in a pre-defined order, parse_date_time()
fails and leaves behind an NA
. I create a dataframe that I then use to figure out what format the date is in, and then create the end_date column based on the format.
I am aware that lubridate
also features the function guess_formats()
, but this is less helpful because the function outputs many possible formats.
Code below:
library(lubridate)
library(tidyverse)
#Create dataframe
species <- c("mudkip", "treecko", "torchic", "swampert", "galvantula", "volcarona", "joltik")
raw_date <- c("5 Jan 2013", "February 2018", "10/2001", "25/12/1984", "2020", "2012-03-12", "not a date")
# Added Volcarona to my species to have an example in which the year is first in the date format
# Added Joltik to represent a species in which the date is blank or otherwise unreadable
df <- data.frame(species, raw_date)
# Create a dataframe for formatting the dataset - you can just add these columns to the main dataframe, but I found this neater
date_formatting <- data.frame(matrix(ncol=5, nrow=length(df$raw_date)))
names(date_formatting) <- c("raw_date", "ymd","dmy", "my", "y")
date_formatting$raw_date <- df$raw_date
date_formatting$ymd <- parse_date_time(date_formatting$raw_date, orders = c("ymd"))
date_formatting$dmy <- parse_date_time(date_formatting$raw_date, orders = c("dmy"))
date_formatting$my <- parse_date_time(date_formatting$raw_date, orders = c("my"))
date_formatting$y <- parse_date_time(date_formatting$raw_date, orders = c("y"))
# You can use other date formats based on your needs.
# Find what the date format is based on what did and didn't worked
date_formatting <- date_formatting %>%
mutate(format = case_when(
!is.na(ymd) ~ "ymd",
!is.na(dmy) ~ "dmy",
!is.na(my) ~ "my",
!is.na(y) ~ "y",
# TRUE to account for dates that can't be read due to format - could leave blank.
TRUE ~ "unreadable"
))
df$format <- date_formatting$format
# Start date works as normal regardless of date format
df$start_date <- parse_date_time(df$raw_date, orders = c("ymd","dmy", "my", "y"))
# For end date, calculate values based on format
df <- df %>%
mutate(end_date = case_when(
format == "ymd" | format == "dmy" ~ parse_date_time(raw_date, orders = c("ymd","dmy")),
# When going to the end of the year/month, make sure to use %m % to account for varying lengths of months/years.
# Subtract 1 day to get the last day of that month/year (rather than the first day of the subsequent day/year)
format == "my" ~ (parse_date_time(raw_date, orders = c("my")) %m % months(1)) - days(1),
format == "y" ~ (parse_date_time(raw_date, orders = c("y")) %m % years(1)) - days(1),
TRUE ~ NA_real_))
I hope this helps any future people with similar issues! If anyone has a more elegant solution I'd love to know.