Home > database >  Match string between two colons in multiline string
Match string between two colons in multiline string

Time:08-26

I am using stringr to extract the date and the train identifier for each row in a dataframe from the variable status. The string should have the following format:

2008-07-01 : Train_528 :cancelled
2005-11-01 : Train_528 :postponed
2005-06-28 : Train_528 :ontime

I use the following code to extract train id and date:

train_df %>% 
  separate_rows(status, sep = "\\n") %>% 
  mutate(date = as.Date(str_extract(status, "\\d.*\\d")),
         train_id = str_extract(status, "(?<=:)\\w.*(?= :ontime)"))

The code works successfully. However, in some cases, the data is formatted incorrectly where the train ID should have been between the date and the status.

2008-07-01 : :cancelled
2005-11-01 : :postponed
2005-06-28 : :ontime
 :Train_528 :cancelled
 :Train_528 :postponed
 :Train_528 :ontime

The main way to identify this issue is to match two colons with no characters in between:": :". What pattern can I use to pull the train ID without matching the train status string.

I tried using the following code but failed:

train_df %>% 
  separate_rows(status, sep = "\\n") %>%
  mutate(status_duplicated = status) %>% 
  mutate(date = as.Date(str_extract(status, "\\d.*\\d")),
         train_id = if_else(str_detect(status, ":\\s:"), 
                            str_extract(status_duplicated, "(?<=:)\\w.*(?= :ontime)"),
                            str_extract(status, "(?<=:)\\w.*(?= :)")))

Reprex:

train_df <- structure(list(county = 1:3, status = c("2008-07-01 : :cancelled\n2005-11-01 : :postponed\n2005-06-28 : :ontime\n :Train_528 :cancelled\n :Train_528 :postponed\n :Train_528 :ontime", 
"2017-01-13 :Train_222 :ontime\n2016-09-30 :Train_222 :postponed\n2016-09-14 :Train_222 :cancelled\n2014-08-07 :TR 1323 :cancelled\n :TR 1323 :postponed", 
"1985-05-18 :Train_12 :ontime\n1981-12-15 :Train_12 :postponed"
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-3L))

CodePudding user response:

Your expected outcome is implicit. Here is a possible solution:

library(tidyverse)

train_df %>%
  separate_rows(status, sep = "\n") %>%
  separate(status, c("date", "train_id", "status"), sep = "\\s*:\\s*") %>%
  mutate(across(-county, na_if, "")) %>%
  fill(date, train_id, .direction = "downup")

# A tibble: 13 × 4
   county date       train_id  status   
    <int> <chr>      <chr>     <chr>    
 1      1 2008-07-01 Train_528 cancelled
 2      1 2005-11-01 Train_528 postponed
 3      1 2005-06-28 Train_528 ontime   
 4      1 2005-06-28 Train_528 cancelled
 5      1 2005-06-28 Train_528 postponed
 6      1 2005-06-28 Train_528 ontime   
 7      2 2017-01-13 Train_222 ontime   
 8      2 2016-09-30 Train_222 postponed
 9      2 2016-09-14 Train_222 cancelled
10      2 2014-08-07 TR 1323   cancelled
11      2 2014-08-07 TR 1323   postponed
12      3 1985-05-18 Train_12  ontime   
13      3 1981-12-15 Train_12  postponed

CodePudding user response:

train_df %>% 
  mutate(status_duplicated = status) %>% 
  separate_rows(status, sep = "\\n") %>% 
  mutate(date = if_else(str_detect(status_duplicated, "\\d.*\\d : :(ontime|cancelled|postponed)"),
                                 as.Date(str_extract(status_duplicated, "\\d.*\\d(?= : :ontime)")),
                                 as.Date(str_extract(status, "\\d.*\\d"))),
         train_id = if_else(str_detect(status_duplicated, "\\d.*\\d : :(ontime|cancelled|postponed)"), 
                                str_extract(status_duplicated, "(?:(?<= :)\\w.*(?= :ontime))"),
                                str_extract(status, "(?<=:)\\w.*(?= :)"))) %>% 
  unique()

# A tibble: 13 × 4
   county status                             date       train_id 
    <int> <chr>                              <date>     <chr>    
 1      1 "2008-07-01 : :cancelled"          2005-06-28 Train_528
 2      1 "2005-11-01 : :postponed"          2005-06-28 Train_528
 3      1 "2005-06-28 : :ontime"             2005-06-28 Train_528
 4      1 " :Train_528 :cancelled"           2005-06-28 Train_528
 5      1 " :Train_528 :postponed"           2005-06-28 Train_528
 6      1 " :Train_528 :ontime"              2005-06-28 Train_528
 7      2 "2017-01-13 :Train_222 :ontime"    2017-01-13 Train_222
 8      2 "2016-09-30 :Train_222 :postponed" 2016-09-30 Train_222
 9      2 "2016-09-14 :Train_222 :cancelled" 2016-09-14 Train_222
10      2 "2014-08-07 :TR 1323 :cancelled"   2014-08-07 TR 1323  
11      2 " :TR 1323 :postponed"             NA         TR 1323  
12      3 "1985-05-18 :Train_12 :ontime"     1985-05-18 Train_12 
13      3 "1981-12-15 :Train_12 :postponed"  1981-12-15 Train_12

The issue with this solution is that I have to duplicate the output and remove duplicates using the unique function.

  • Related