Based on the data and code below, how can I change every incorrect date to 01/01/year from year column
?
Data:
df <- structure(list(Year = c("2022", "2022", "2022", "2022", "2022",
"2022", "2005", "2005", "2020", "2021"), `Reference Number` = c("14784",
"14784", "14785", "14785", "14786", "14786", "14787", "14787",
"14788", "14788"), `Request Date` = c("1/6/2022", "1/6/2022",
"11/19/2022", "Happy New Year", "1899-12-31 02:40:00", "Ongoing", "1//12/05",
"01/14/205", "1/25/20`22", NA)), row.names = c(NA, -10L
), class = c("tbl_df", "tbl", "data.frame"))
df
# A tibble: 10 × 3
Year `Reference Number` `Request Date`
<chr> <chr> <chr>
1 2022 14784 1/6/2022
2 2022 14784 1/6/2022
3 2022 14785 11/19/2022
4 2022 14785 Happy New Year
5 2022 14786 1899-12-31 02:40:00
6 2022 14786 Ongoing
7 2005 14787 1//12/05
8 2005 14787 01/14/205
9 2020 14788 1/25/20`22
10 2021 14788 NA
Desired output:
desired <- structure(list(Year = c("2022", "2022", "2022", "2022", "2022",
"2022", "2005", "2005", "2022", "2021"), `Reference Number` = c("14784",
"14784", "14785", "14785", "14786", "14786", "14787", "14787",
"14788", "14788"), `Request Date` = c("1/6/2022", "1/6/2022",
"11/19/2022", "01/01/2022", "01/01/2022", "01/01/2022", "01/01/2005",
"01/01/2005", "01/01/2022", "01/01/2021")), row.names = c(NA, -10L
), class = c("tbl_df", "tbl", "data.frame"))
desired
# A tibble: 10 × 3
Year `Reference Number` `Request Date`
<chr> <chr> <chr>
1 2022 14784 1/6/2022
2 2022 14784 1/6/2022
3 2022 14785 11/19/2022
4 2022 14785 01/01/2022
5 2022 14786 01/01/2022
6 2022 14786 01/01/2022
7 2005 14787 01/01/2005
8 2005 14787 01/01/2005
9 2022 14788 01/01/2022
10 2021 14788 01/01/2021
Code:
library(tidyverse)
df = df %>% mutate(`Reference Date` = .....`Reference Date`)
CodePudding user response:
To illustrate the difficulty (impossibility?) of addressing your question satisfactorily, consider the following
library(anytime)
df %>%
# Remove duplicate separators
mutate(`Request Date` = str_replace_all(`Request Date`, "//", "/")) %>%
# Substition logic
mutate(Date = if_else(
str_detect(`Request Date`, "^. /. /. $"),
anydate(`Request Date`),
as.Date(sprintf("%s-01-01", Year))))
## A tibble: 10 × 4
# Year `Reference Number` `Request Date` Date
# <chr> <chr> <chr> <date>
# 1 2022 14784 1/6/2022 2022-01-06
# 2 2022 14784 1/6/2022 2022-01-06
# 3 2022 14785 11/19/2022 2022-11-19
# 4 2022 14785 Happy New Year 2022-01-01
# 5 2022 14786 1899-12-31 02:40:00 2022-01-01
# 6 2022 14786 Ongoing 2022-01-01
# 7 2005 14787 1/12/05 NA
# 8 2005 14787 01/14/205 NA
# 9 2020 14788 1/25/20`22 NA
#10 2021 14788 NA NA
Here we're using anytime::anydate
which tries to guess a date from an arbitrary date-like string.
The problem are ambiguities: Even if we remove duplicate separators, what date does "1/12/05" correspond to? 1 Dec 2005? 12 January 2005? I don't think this can be answered in a way that it addresses all edge cases automatically. Instead you'll need some heuristic (probably realised through some case_when
or nested if_else
statements) to get the best-possible solution for your specific situation.