Home > Blockchain >  Extract differently formatted dates (as strings) with tidyr::extract
Extract differently formatted dates (as strings) with tidyr::extract

Time:04-18

I have this data.frame

structure(list(date = c("28. Dezember 2004", "29. Dezember 2004", 
"30. Dezember 2004", "5. Jan. 2005", "27. Jan. 2005", "16. Feb. 2005"
)), row.names = 617:622, class = "data.frame")

                 date
617 28. Dezember 2004
618 29. Dezember 2004
619 30. Dezember 2004
620      5. Jan. 2005
621     27. Jan. 2005
622     16. Feb. 2005

And I want to extract the day, the month (Which starts to be formatted differently in 2005) and the year. So I do the following:

df %>%
    extract(
        "date", 
        into = c("day", "month", "year"),
        regex = ("(^\\d{1,2})\\. ([a-zA-Zä]*)\\.? (\\d{4})"),
        convert = F,
        remove = F
    )

Yet the result is:

                 date  day    month year
617 28. Dezember 2004   28 Dezember 2004
618 29. Dezember 2004   29 Dezember 2004
619 30. Dezember 2004   30 Dezember 2004
620      5. Jan. 2005 <NA>     <NA> <NA>
621     27. Jan. 2005 <NA>     <NA> <NA>
622     16. Feb. 2005 <NA>     <NA> <NA>

I am not sure exactly what is going wrong here

CodePudding user response:

A possible solution, based on tidyr::separate:

library(tidyverse)

df %>% 
  separate(date, into = c("day", "month", "year"), sep = " ", convert = T, remove = F) 

#>                  date day    month year
#> 617 28. Dezember 2004  28 Dezember 2004
#> 618 29. Dezember 2004  29 Dezember 2004
#> 619 30. Dezember 2004  30 Dezember 2004
#> 620      5. Jan. 2005   5     Jan. 2005
#> 621     27. Jan. 2005  27     Jan. 2005
#> 622     16. Feb. 2005  16     Feb. 2005

Or using lubridate:

library(tidyverse)
library(lubridate)

df %>%
  mutate(date = str_replace(date, "Dezember", "December")) %>% 
  mutate(day = day(dmy(date)),
         month = month(dmy(date), label = T, abbr = F),
         year = year(dmy(date)))

#>                  date day    month year
#> 617 28. December 2004  28 December 2004
#> 618 29. December 2004  29 December 2004
#> 619 30. December 2004  30 December 2004
#> 620      5. Jan. 2005   5  January 2005
#> 621     27. Jan. 2005  27  January 2005
#> 622     16. Feb. 2005  16 February 2005

CodePudding user response:

To fix the regex part of extract() in your code, try

df %>%
  extract(
    date, 
    into = c("day", "month", "year"),
    regex = "(\\d )[. ] (\\w )[. ] (\\d )",
    remove = FALSE,
    convert = TRUE
  )

# # A tibble: 6 × 4
#   date                day month     year
#   <chr>             <int> <chr>    <int>
# 1 28. Dezember 2004    28 Dezember  2004
# 2 29. Dezember 2004    29 Dezember  2004
# 3 30. Dezember 2004    30 Dezember  2004
# 4 5. Jan. 2005          5 Jan       2005
# 5 27. Jan. 2005        27 Jan       2005
# 6 16. Feb. 2005        16 Feb       2005

It's equivalent to set sep = "[. ] " in separate().

  • Related