Home > Back-end >  How to convert charater date column with two different date formats to date data type
How to convert charater date column with two different date formats to date data type

Time:03-26

I'm new to R programming and currently working on a data frame, which is imported from a .csv file that has a "Date" column. This columns consists of two different date format (dd/mm/yyyy & mm/dd/yyyy) such as this: Image One

Then I went ahead and attempted to convert them all, which as of right now is a character type, to date data type with the following code:

test$Date <- as.Date(test$Date, format = "%d/%m/%Y")

The above code certainly works, and now the "Date" column is now converted to date type but some of the dates appear to be missing due to some kind of problem after running the code: Image Two

I have no idea what's the issue here and would like to ask if anyone knows how to fix it. I would appreciate all the help. Thank you!

CodePudding user response:

This is a real life mess where you have messy data - merged two date formats in 1 column.

  1. Please see if you can make both to same format : either dmy or ymd 2 OR if you can identify thsoe with some other data column? (For example some other variable which can help you identify which records have dmy and which has mdy
  2. If 2nd is possible, mutate the date column conditionally or use case_when.

OR : in try in 2 passes (or as many formats) : Apply lubridate: dmy() first and then apply ymd() where results are NA in first pass.

Example: Here sourced from A is dd/mm/yyyy and B are mm/dd/yyyy :

a <- data.frame(dates = c('02/01/2020','02/04/2020','02/06/2020',
                     '12/31/2020','11/21/2020')
           ,Sourced = c('A','A','A','B','B'))

First pass :

b <- a%>%
  mutate(Clean_Date1 = dmy(dates))

       dates Sourced Clean_Date1
1 02/01/2020       A  2020-01-02
2 02/04/2020       A  2020-04-02
3 02/06/2020       A  2020-06-02
4 12/31/2020       B        <NA>
5 11/21/2020       B        <NA>

Now using this :

b%>%
  mutate(Clean_Date2 = if_else(is.na(Clean_Date1),mdy(dates),Clean_Date1))

       dates Sourced Clean_Date1 Clean_Date2
1 02/01/2020       A  2020-01-02  2020-01-02
2 02/04/2020       A  2020-04-02  2020-04-02
3 02/06/2020       A  2020-06-02  2020-06-02
4 12/31/2020       B        <NA>  2020-12-31
5 11/21/2020       B        <NA>  2020-11-21

Hope this gives some ideas as final column is all in same date format:

'data.frame':   5 obs. of  4 variables:
 $ dates      : chr  "02/01/2020" "02/04/2020" "02/06/2020" "12/31/2020" ...
 $ Sourced    : chr  "A" "A" "A" "B" ...
 $ Clean_Date1: Date, format: "2020-01-02" "2020-04-02" "2020-06-02" ...
 $ Clean_Date2: Date, format: "2020-01-02" "2020-04-02" "2020-06-02" ...

CodePudding user response:

I don't think that it is not possible the way you want to do it!

The closest function I have in mind is parse_date_time from lubridate package which makes use of guess_formats function:

library(lubridate)

string_date <- c("29/02/2020", "7/01/2020")
string_date


parse_date_time(string_date, c("dmY", "mdY"))

# gives:
# [1] "2020-02-29 UTC" "2020-01-07 UTC"

And I don't think that two different date formats in one column as class date are possible.

So try to clean up your data and make two different columns or flag your class character column!

  • Related