I would like to use some sort of regex function for the rest of the data (~2 million rows) to extract relevant date information (ideally have Day, Month, and time in date format since Year is only 2009 for ~2 million rows).
I have a column that looks like so:
ID | created_at
1 Mon Apr 06 22:19:45 PDT 2009
2 Mon Apr 06 22:19:49 PDT 2009
I applied these two functions to extract "day" and delete 'PDT 2009' from the end but now I would like to make the rest of the columns be date format for plotting purposes.
df$Day <- sub("([A-Za-z] ).*", "\\1", df$created_at) ## Extract first word
df$delete <- gsub("\\PDT.*","", df$created_at) ## Delete everything after PDT starts
Desired outcome:
ID | created_at | Month | Day | Time
1 Mon Apr 06 22:19:45 PDT 2009 Apr Mon 22:19:45
2 Mon Apr 06 22:19:49 PDT 2009 Apr Mon 22:19:49
CodePudding user response:
You do not need to use any regexes, just regular date formatting is sufficient. you can find a nice overview here or in ?strptime()
. You just have to adjust for the separators. This should be easier and more efficient than using regexes, splitting, etc...
Once you have this in native R DateTime formats POSIXlt
and POSIXt
, you can easily extract all date-related information.
strptime(x = "Mon Apr 06 22:19:45 PDT 2009",
format = "%a %b %d %H:%M:%S PDT %Y")
#> [1] "2009-04-06 22:19:45 CEST"
CodePudding user response:
Here is an approach using str_split
in companion with map_char
:
library(tidyverse)
df %>%
mutate(elements = str_split(created_at, fixed(" "), n=6)) %>%
mutate(Month = map_chr(elements, 2),
Day = map_chr(elements, 1),
Time = map_chr(elements, 4), .keep="unused"
)
output:
ID created_at Month Day Time
1 1 Mon Apr 06 22:19:45 PDT 2009 Apr Mon 22:19:45
2 2 Mon Apr 06 22:19:49 PDT 2009 Apr Mon 22:19:49
CodePudding user response:
You can use the following solution too:
library(dplyr)
df %>%
mutate(ID = row_number(),
Month = gsub("(?:[A-Za-z] )\\s([A-Za-z] ).*", "\\1", created_at, perl = TRUE),
Day = gsub("([A-Za-z] ).*", "\\1", created_at, perl = TRUE),
Time = gsub(".*(\\d{2}:\\d{2}:\\d{2}).*", "\\1", created_at, perl = TRUE)) %>%
relocate(ID)
# A tibble: 2 x 5
ID created_at Month Day Time
<int> <chr> <chr> <chr> <chr>
1 1 Mon Apr 06 22:19:45 PDT 2009 Apr Mon 22:19:45
2 2 Mon Apr 06 22:19:49 PDT 2009 Apr Mon 22:19:49
CodePudding user response:
If you're just after Month
, Day
, and Time
, why not use extract
from the tidyverse
:
library(tidyr)
df %>%
extract(col = created_at,
into = c('Month', 'Day', 'Time'),
regex = "([A-Za-z] )\\s([A-Za-z] )\\s\\d{2}\\s([\\d:] )")
Month Day Time
1 Mon Apr 22:19:45
2 Mon Apr 22:19:49
Here, we define three capturing groups using the round brackets syntax (...)
to identify the substrings we want to extract
into the three columns.
If you also need created_at
in its original form, just store the results as, say, df1
and use cbind
:
cbind(df, df1)
created_at Month Day Time
1 Mon Apr 06 22:19:45 PDT 2009 Mon Apr 22:19:45
2 Mon Apr 06 22:19:49 PDT 2009 Mon Apr 22:19:49
Data:
df <-
data.frame(
created_at = c("Mon Apr 06 22:19:45 PDT 2009","Mon Apr 06 22:19:49 PDT 2009")
)
CodePudding user response:
I think this might help you
Libraries
library(tidyverse)
library(lubridate)
Data
df <-
tibble(
created_at = c("Mon Apr 06 22:19:45 PDT 2009","Mon Apr 06 22:19:49 PDT 2009")
)
Code
df %>%
separate(
col = created_at,
into = c("wday","month","day","time","type","year"),
sep = " ",
remove = FALSE
) %>%
mutate(
day = as.numeric(day),
year = as.numeric(year),
month_num = which(month.abb == month),
time = hms(time),
date = lubridate::make_date(year = year,month = month_num,day = day)
)
Results
# A tibble: 2 x 9
created_at wday month day time type year month_num date
<chr> <chr> <chr> <dbl> <Period> <chr> <dbl> <int> <date>
1 Mon Apr 06 22:19:45 PDT 2009 Mon Apr 6 22H 19M 45S PDT 2009 4 2009-04-06
2 Mon Apr 06 22:19:49 PDT 2009 Mon Apr 6 22H 19M 49S PDT 2009 4 2009-04-06