I am trying to extract dates from text and create a new column in a dataset. Dates are entered in different formats in column A1 (either mm-dd-yy or mm-dd). I need to find a way to identify the date in column A1 and then add the year if it is missing. Thus far, I have been able to extract the date regardless of the format; however, when I use as.Date on the new column A2, the date with mm-dd format becomes <NA>
. I am aware that there might not be a direct solution for this situation, but a workaround (generalizable to a larger data set) would be great. The year would go from September 2019 to August 2020. Additionally, I am not sure why the format I use within the as.Date
function is unable to control how the date gets displayed. This latter issue is not that important, but I am surprised by the behavior of the R function. A solution in tidyverse would be much appreciated.
library(tidyverse)
library(stringr)
db <- data.frame(A1 = c("review 11/18", "begins 12/4/19", "3/5/20", NA, "deadline 09/5/19", "9/3"))
db %>% mutate(A2 = str_extract(A1, "[0-9/0-9] "))
# A1 A2
#1 review 11/18 11/18
#2 begins 12/4/19 12/4/19
#3 3/5/20 3/5/20
#4 <NA> <NA>
#5 deadline 09/5/19 09/5/19
#6 9/3 9/3
db %>% mutate(A2 = str_extract(A1, "[0-9/0-9] ")) %>%
mutate(A2 = A2 %>% as.Date(., "%m/%d/%y"))
# A1 A2
# 1 review 11/18 <NA>
# 2 begins 12/4/19 2019-12-04
# 3 3/5/20 2020-03-05
# 4 <NA> <NA>
# 5 deadline 09/5/19 2019-09-05
# 6 9/3 <NA>
CodePudding user response:
Well, this is neither a beautiful, concise or tidyverse solution but it does work and should be flexible in its modularity.
library(tidyverse)
db <- data.frame(A1 = c("review 11/18", "begins 12/4/19", "3/5/20", NA, "deadline 09/5/19", "9/3"))
db <- db %>% mutate(A2 = str_extract(A1, "[0-9/0-9] "), A2 = str_extract(A1, "[0-9/0-9] "))
test1 <- unlist(lapply(str_split(db$A2, "/", n = 3), function(x) length(x)))
test2 <- lapply(str_split(db$A2, "/", n = 3), function(x) as.numeric(x))
if(test1 == 2){
if(test2[[1]] >= 9){
db$A2 <- ifelse(test = between(nchar(db$A2), 3, 5) & !is.na(db$A2), yes = paste0(db$A2, "/19"), no = db$A2)
}
if(test2[[1]] < 9){
db$A2 <- ifelse(test = between(nchar(db$A2), 3, 5) & !is.na(db$A2), yes = paste0(db$A2, "/20"), no = db$A2)
}
}
db <- db %>% mutate(A2 = A2 %>% as.Date(., "%m/%d/%y"))
db
A1 A2
1 review 11/18 2019-11-18
2 begins 12/4/19 2019-12-04
3 3/5/20 2020-03-05
4 <NA> <NA>
5 deadline 09/5/19 2019-09-05
6 9/3 2019-09-03
CodePudding user response:
Perhaps:
library(tidyverse)
db <- data.frame(A1 = c("review 11/18", "begins 12/4/19", "3/5/20", NA, "deadline 09/5/19", "9/3"))
#year from september to august 2019
(db <-
db %>%
mutate(A2 = str_extract(A1, '[\\d\\d/] '),
A2 = if_else(str_count(A2, '/') == 1 & as.numeric(str_extract(A2, '\\d ')) > 8, paste0(A2, '/19'), A2),
A2 = if_else(str_count(A2, '/') == 1 & as.numeric(str_extract(A2, '\\d ')) <= 8, paste0(A2, '/20'), A2),
A2 = as.Date(A2, "%m/%d/%y")) )
#> A1 A2
#> 1 review 11/18 2019-11-18
#> 2 begins 12/4/19 2019-12-04
#> 3 3/5/20 2020-03-05
#> 4 <NA> <NA>
#> 5 deadline 09/5/19 2019-09-05
#> 6 9/3 2019-09-03
Created on 2021-11-21 by the reprex package (v2.0.1)
CodePudding user response:
I like the rematch2 package for many regex scenarios.
The first pattern tries to match the full m/d/y values. The second patterns tried to match the partial m/d values (furthermore, it separates the month from the day, so it can determine if it should be 2019 or 2020).
Once those pieces are isolated, the rest is just a sequence of small steps.
db |>
rematch2::bind_re_match(from = A1, "^.*?(?<mdy>\\d{1,2}/\\d{1,2}/\\d{2})$") |>
rematch2::bind_re_match(from = A1, "^.*?(?<md_m>\\d{1,2})/(?<md_d>\\d{1,2})$") |>
dplyr::mutate(
md_m = as.integer(md_m),
md_y = dplyr::if_else(9L <= md_m, "19", "20"), # It's 2019 if the month is Sept or later
md = sprintf("%i/%s/%s", md_m, md_d, md_y), # Assemble components
md = as.Date(md , "%m/%d/%y"), # Convert data type
mdy = as.Date(mdy, "%m/%d/%y"), # Convert data type
date = dplyr::coalesce(mdy, md), # Prefer the mdy if it's not missing
)
Output:
A1 mdy md_m md_d md_y md date
1 review 11/18 <NA> 11 18 19 2019-11-18 2019-11-18
2 begins 12/4/19 2019-12-04 4 19 20 2020-04-19 2019-12-04
3 3/5/20 2020-03-05 5 20 20 2020-05-20 2020-03-05
4 <NA> <NA> NA <NA> <NA> <NA> <NA>
5 deadline 09/5/19 2019-09-05 5 19 20 2020-05-19 2019-09-05
6 9/3 <NA> 9 3 19 2019-09-03 2019-09-03