Home > Blockchain >  How to clean messy date formats in a dataframe using R
How to clean messy date formats in a dataframe using R

Time:11-12

What is a quick way to clean a column with multiple date formats and obtain only the year? Suppose in r there is a dataframe (df) as below, which has aDatecolumn of characters with different dates formats.

df <- data.frame(z= paste("Date",seq(1:10)), Date=c("2000-10-22", "9/21/2001", "2003", "2017/2018", "9/28/2010",
                       "9/27/2011","2019/2020", "2017-10/2018-12", "NA", "" ))
df:
     z            Date
1   Date 1      2000-10-22
2   Date 2       9/21/2001
3   Date 3            2003
4   Date 4       2017/2018
5   Date 5       9/28/2010
6   Date 6       9/27/2011
7   Date 7       2019/2020
8   Date 8 2017-10/2018-12
9   Date 9              NA
10 Date 10     

       

Using r commands what is a quick way to extract out the years e.g. 2003, 2010 from the Date column? The first year is to be selected for cells with two years in a row.

So that the expected output would be like below:

     z            Date     year
1   Date 1      2000-10-22 2000
2   Date 2       9/21/2001 2001
3   Date 3            2003 2003
4   Date 4       2007/2018 2017
5   Date 5       9/28/2010 2010
6   Date 6       9/27/2011 2011
7   Date 7       2007/2018 2019
8   Date 8 2017-10/2018-12 2017
9   Date 9              NA   NA
10 Date 10                 

CodePudding user response:

Use extract from tidyr. If there are two years it will use the first.

library(dplyr)
library(tidyr)

df %>% extract(Date, "Year", "(\\d{4})", remove = FALSE, convert = TRUE)

giving:

         z            Date Year
1   Date 1      2000-10-22 2000
2   Date 2       9/21/2001 2001
3   Date 3            2003 2003
4   Date 4       2017/2018 2017
5   Date 5       9/28/2010 2010
6   Date 6       9/27/2011 2011
7   Date 7       2019/2020 2019
8   Date 8 2017-10/2018-12 2017
9   Date 9              NA   NA
10 Date 10                   NA

If the second year is needed as well then:

df %>% 
  extract(Date, "Year2", "\\d{4}.*(\\d{4})", remove = FALSE, convert = TRUE) %>%
  extract(Date, "Year", "(\\d{4})", remove = FALSE, convert = TRUE)

giving:

         z            Date Year Year2
1   Date 1      2000-10-22 2000    NA
2   Date 2       9/21/2001 2001    NA
3   Date 3            2003 2003    NA
4   Date 4       2017/2018 2017  2018
5   Date 5       9/28/2010 2010    NA
6   Date 6       9/27/2011 2011    NA
7   Date 7       2019/2020 2019  2020
8   Date 8 2017-10/2018-12 2017  2018
9   Date 9              NA   NA    NA
10 Date 10                   NA    NA
  • Related