I have this data.frame
structure(list(date = c("28. Dezember 2004", "29. Dezember 2004",
"30. Dezember 2004", "5. Jan. 2005", "27. Jan. 2005", "16. Feb. 2005"
)), row.names = 617:622, class = "data.frame")
date
617 28. Dezember 2004
618 29. Dezember 2004
619 30. Dezember 2004
620 5. Jan. 2005
621 27. Jan. 2005
622 16. Feb. 2005
And I want to extract the day, the month (Which starts to be formatted differently in 2005) and the year. So I do the following:
df %>%
extract(
"date",
into = c("day", "month", "year"),
regex = ("(^\\d{1,2})\\. ([a-zA-Zä]*)\\.? (\\d{4})"),
convert = F,
remove = F
)
Yet the result is:
date day month year
617 28. Dezember 2004 28 Dezember 2004
618 29. Dezember 2004 29 Dezember 2004
619 30. Dezember 2004 30 Dezember 2004
620 5. Jan. 2005 <NA> <NA> <NA>
621 27. Jan. 2005 <NA> <NA> <NA>
622 16. Feb. 2005 <NA> <NA> <NA>
I am not sure exactly what is going wrong here
CodePudding user response:
A possible solution, based on tidyr::separate
:
library(tidyverse)
df %>%
separate(date, into = c("day", "month", "year"), sep = " ", convert = T, remove = F)
#> date day month year
#> 617 28. Dezember 2004 28 Dezember 2004
#> 618 29. Dezember 2004 29 Dezember 2004
#> 619 30. Dezember 2004 30 Dezember 2004
#> 620 5. Jan. 2005 5 Jan. 2005
#> 621 27. Jan. 2005 27 Jan. 2005
#> 622 16. Feb. 2005 16 Feb. 2005
Or using lubridate
:
library(tidyverse)
library(lubridate)
df %>%
mutate(date = str_replace(date, "Dezember", "December")) %>%
mutate(day = day(dmy(date)),
month = month(dmy(date), label = T, abbr = F),
year = year(dmy(date)))
#> date day month year
#> 617 28. December 2004 28 December 2004
#> 618 29. December 2004 29 December 2004
#> 619 30. December 2004 30 December 2004
#> 620 5. Jan. 2005 5 January 2005
#> 621 27. Jan. 2005 27 January 2005
#> 622 16. Feb. 2005 16 February 2005
CodePudding user response:
To fix the regex
part of extract()
in your code, try
df %>%
extract(
date,
into = c("day", "month", "year"),
regex = "(\\d )[. ] (\\w )[. ] (\\d )",
remove = FALSE,
convert = TRUE
)
# # A tibble: 6 × 4
# date day month year
# <chr> <int> <chr> <int>
# 1 28. Dezember 2004 28 Dezember 2004
# 2 29. Dezember 2004 29 Dezember 2004
# 3 30. Dezember 2004 30 Dezember 2004
# 4 5. Jan. 2005 5 Jan 2005
# 5 27. Jan. 2005 27 Jan 2005
# 6 16. Feb. 2005 16 Feb 2005
It's equivalent to set sep = "[. ] "
in separate()
.