How to fix date typos in R?-CodePudding

I have a data set with a good hundred thousand lines in it. somehow.. the data provider sent it to me with all the dates formatted like 1/1/20202021 08:07:43 AM (mdy_hms). The correct year should be the last four in year for every row.

lubridate::mdy_hms() obviously cant recognize this. So I am trying to figure out how I could use grep or similar to pull out the correct date time. Any ideas?

Thanks everyone (:

CodePudding user response：

You can handle this with functions in the stringr package. First, get the correct year by extracting it from the date variable. For example,

library(stringr)

date_value <- "1/1/20202021 08:07:43 AM"

correct_year <- str_sub(
  str_extract(date_value, pattern = "\\d{8}\\s"), 5, 10
)

This returns "2021 ". You can now use str_replace() to replace the 8-digit bad year with correct_year:

str_replace(date_value, pattern = "\\d{8}\\s", replacement = correct_year)

[1] "1/1/2021 08:07:43 AM"

To perform this operation across the whole data frame you can do something like this:

library(tidyverse)

df %>% 
  mutate(
    date_value = str_replace(
      date_value, 
      pattern = "\\d{8}\\s", 
      replacement = str_sub(
        str_extract(date_value, pattern = "\\d{8}\\s"), 5, 10
      )
    )
  )

CodePudding user response：

You can extract only the 2nd 4-digit year with sub.

x <- "1/1/20202021 08:07:43 AM"
lubridate::mdy_hms(sub('(\\d{4})(\\d{4})', '\\2', x))
#[1] "2021-01-01 08:07:43 UTC"

To apply this to entire column you replace x with df$column_name.