I have a data set with a good hundred thousand lines in it.
somehow.. the data provider sent it to me with all the dates formatted like 1/1/20202021 08:07:43 AM
(mdy_hms). The correct year should be the last four in year for every row.
lubridate::mdy_hms() obviously cant recognize this. So I am trying to figure out how I could use grep or similar to pull out the correct date time. Any ideas?
Thanks everyone (:
CodePudding user response:
You can handle this with functions in the stringr
package. First, get the correct year by extracting it from the date variable. For example,
library(stringr)
date_value <- "1/1/20202021 08:07:43 AM"
correct_year <- str_sub(
str_extract(date_value, pattern = "\\d{8}\\s"), 5, 10
)
This returns "2021 "
. You can now use str_replace()
to replace the 8-digit bad year with correct_year
:
str_replace(date_value, pattern = "\\d{8}\\s", replacement = correct_year)
[1] "1/1/2021 08:07:43 AM"
To perform this operation across the whole data frame you can do something like this:
library(tidyverse)
df %>%
mutate(
date_value = str_replace(
date_value,
pattern = "\\d{8}\\s",
replacement = str_sub(
str_extract(date_value, pattern = "\\d{8}\\s"), 5, 10
)
)
)
CodePudding user response:
You can extract only the 2nd 4-digit year with sub
.
x <- "1/1/20202021 08:07:43 AM"
lubridate::mdy_hms(sub('(\\d{4})(\\d{4})', '\\2', x))
#[1] "2021-01-01 08:07:43 UTC"
To apply this to entire column you replace x
with df$column_name
.