I am using stringr
to extract the date and the train identifier for each row in a dataframe from the variable status
. The string should have the following format:
2008-07-01 : Train_528 :cancelled
2005-11-01 : Train_528 :postponed
2005-06-28 : Train_528 :ontime
I use the following code to extract train id and date:
train_df %>%
separate_rows(status, sep = "\\n") %>%
mutate(date = as.Date(str_extract(status, "\\d.*\\d")),
train_id = str_extract(status, "(?<=:)\\w.*(?= :ontime)"))
The code works successfully. However, in some cases, the data is formatted incorrectly where the train ID should have been between the date and the status.
2008-07-01 : :cancelled
2005-11-01 : :postponed
2005-06-28 : :ontime
:Train_528 :cancelled
:Train_528 :postponed
:Train_528 :ontime
The main way to identify this issue is to match two colons with no characters in between:": :"
. What pattern can I use to pull the train ID without matching the train status string.
I tried using the following code but failed:
train_df %>%
separate_rows(status, sep = "\\n") %>%
mutate(status_duplicated = status) %>%
mutate(date = as.Date(str_extract(status, "\\d.*\\d")),
train_id = if_else(str_detect(status, ":\\s:"),
str_extract(status_duplicated, "(?<=:)\\w.*(?= :ontime)"),
str_extract(status, "(?<=:)\\w.*(?= :)")))
Reprex:
train_df <- structure(list(county = 1:3, status = c("2008-07-01 : :cancelled\n2005-11-01 : :postponed\n2005-06-28 : :ontime\n :Train_528 :cancelled\n :Train_528 :postponed\n :Train_528 :ontime",
"2017-01-13 :Train_222 :ontime\n2016-09-30 :Train_222 :postponed\n2016-09-14 :Train_222 :cancelled\n2014-08-07 :TR 1323 :cancelled\n :TR 1323 :postponed",
"1985-05-18 :Train_12 :ontime\n1981-12-15 :Train_12 :postponed"
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-3L))
CodePudding user response:
Your expected outcome is implicit. Here is a possible solution:
library(tidyverse)
train_df %>%
separate_rows(status, sep = "\n") %>%
separate(status, c("date", "train_id", "status"), sep = "\\s*:\\s*") %>%
mutate(across(-county, na_if, "")) %>%
fill(date, train_id, .direction = "downup")
# A tibble: 13 × 4
county date train_id status
<int> <chr> <chr> <chr>
1 1 2008-07-01 Train_528 cancelled
2 1 2005-11-01 Train_528 postponed
3 1 2005-06-28 Train_528 ontime
4 1 2005-06-28 Train_528 cancelled
5 1 2005-06-28 Train_528 postponed
6 1 2005-06-28 Train_528 ontime
7 2 2017-01-13 Train_222 ontime
8 2 2016-09-30 Train_222 postponed
9 2 2016-09-14 Train_222 cancelled
10 2 2014-08-07 TR 1323 cancelled
11 2 2014-08-07 TR 1323 postponed
12 3 1985-05-18 Train_12 ontime
13 3 1981-12-15 Train_12 postponed
CodePudding user response:
train_df %>%
mutate(status_duplicated = status) %>%
separate_rows(status, sep = "\\n") %>%
mutate(date = if_else(str_detect(status_duplicated, "\\d.*\\d : :(ontime|cancelled|postponed)"),
as.Date(str_extract(status_duplicated, "\\d.*\\d(?= : :ontime)")),
as.Date(str_extract(status, "\\d.*\\d"))),
train_id = if_else(str_detect(status_duplicated, "\\d.*\\d : :(ontime|cancelled|postponed)"),
str_extract(status_duplicated, "(?:(?<= :)\\w.*(?= :ontime))"),
str_extract(status, "(?<=:)\\w.*(?= :)"))) %>%
unique()
# A tibble: 13 × 4
county status date train_id
<int> <chr> <date> <chr>
1 1 "2008-07-01 : :cancelled" 2005-06-28 Train_528
2 1 "2005-11-01 : :postponed" 2005-06-28 Train_528
3 1 "2005-06-28 : :ontime" 2005-06-28 Train_528
4 1 " :Train_528 :cancelled" 2005-06-28 Train_528
5 1 " :Train_528 :postponed" 2005-06-28 Train_528
6 1 " :Train_528 :ontime" 2005-06-28 Train_528
7 2 "2017-01-13 :Train_222 :ontime" 2017-01-13 Train_222
8 2 "2016-09-30 :Train_222 :postponed" 2016-09-30 Train_222
9 2 "2016-09-14 :Train_222 :cancelled" 2016-09-14 Train_222
10 2 "2014-08-07 :TR 1323 :cancelled" 2014-08-07 TR 1323
11 2 " :TR 1323 :postponed" NA TR 1323
12 3 "1985-05-18 :Train_12 :ontime" 1985-05-18 Train_12
13 3 "1981-12-15 :Train_12 :postponed" 1981-12-15 Train_12
The issue with this solution is that I have to duplicate the output and remove duplicates using the unique function.