Home > Enterprise >  regular expression: how to find and replace a sub-string togther with any characters precede it unti
regular expression: how to find and replace a sub-string togther with any characters precede it unti

Time:11-07

I was trying to find the time pattern which contains any kind of am or pm with number before them and wanted to replace the whole pattern with --.

What I thought was to find the string with am or pm which may or may not contain a dot . before/between/after them, and then extract together with any number pattern before them until I reach a white space.

Here is the original data t0:

t0 <- c("29th October 2022 5-6pm", "12-1pm 02/11/22", "10:25 bike rack at bexley college erith", "November 2nd 2022, apm shop ", " between 7pm Thursday 27th October to Saturday 29th October 9am", "04/09/2022 at 4 a.m.", "4/09/2022 at 4.a.m.", "04/09/2022 at 4.a.m" , "28.10.22 between 1.30pm and midnight", " Sunday 30th October 2022 between 11am and 3pm", "30th October, approx 6pm", "03/11/2022", "02/11/22 at campus", "Between 15:15 and 21:10", "03/11/2022 7pm", " Between 5:30pm and 6:30pm on 31/10/2022", "10am-2pm 31 oct 2022", "31/10/22 5.15am", " Tuesday 25th October 2022. 10:30pm", "30/10/2022 6pm")

I then create two variables, t1 and t2, to store the search result and the gsub result, this is what I get:

library("stringr")

t1 <- t0[str_detect(t0, "\\s[\\s|0-9|\\.|:] a\\.?m\\.?|p\\.?m\\.?")]
t2 <- t1 %>% gsub("\\s[\\s|0-9|\\.|:] a\\.?m\\.?|p\\.?m\\.?","--", .)

> t1
 [1] "29th October 2022 5-6pm"                                         "12-1pm 02/11/22"                                                
 [3] "November 2nd 2022, apm shop "                                    " between 7pm Thursday 27th October to Saturday 29th October 9am"
 [5] "04/09/2022 at 4 a.m."                                            "4/09/2022 at 4.a.m."                                            
 [7] "04/09/2022 at 4.a.m"                                             "28.10.22 between 1.30pm and midnight"                           
 [9] " Sunday 30th October 2022 between 11am and 3pm"                  "30th October, approx 6pm"                                       
[11] "03/11/2022 7pm"                                                  " Between 5:30pm and 6:30pm on 31/10/2022"                       
[13] "10am-2pm 31 oct 2022"                                            "31/10/22 5.15am"                                                
[15] " Tuesday 25th October 2022. 10:30pm"                             "30/10/2022 6pm"   

> t2
 [1] "29th October 2022 5-6--"                                       "12-1-- 02/11/22"                                              
 [3] "November 2nd 2022, a-- shop "                                  " between 7-- Thursday 27th October to Saturday 29th October--"
 [5] "04/09/2022 at 4 a.m."                                          "4/09/2022 at--"                                               
 [7] "04/09/2022 at--"                                               "28.10.22 between 1.30-- and midnight"                         
 [9] " Sunday 30th October 2022 between-- and 3--"                   "30th October, approx 6--"                                     
[11] "03/11/2022 7--"                                                " Between 5:30-- and 6:30-- on 31/10/2022"                     
[13] "10am-2-- 31 oct 2022"                                          "31/10/22--"                                                   
[15] " Tuesday 25th October 2022. 10:30--"                           "30/10/2022 6--"   

While the desired result is:

> t2
[1] "29th October 2022--"                                              "-- 02/11/22"                                              
[3] " between-- Thursday 27th October to Saturday 29th October--"      "04/09/2022 at--"
[5] "4/09/2022 at--"                                                   "04/09/2022 at--"                                               
[7] "28.10.22 between-- and midnight"                                  " Sunday 30th October 2022 between-- and--"                   
[9] "30th October, approx--"                                           "03/11/2022--"                                                
[11] " Between-- and-- on 31/10/2022"                                  "----- 31 oct 2022"                                          
[13] "31/10/22--"                                                      " Tuesday 25th October 2022.--"                           
[15] "30/10/2022--"   

How should I correct the regex pattern?

CodePudding user response:

t1 <- gsub("\\s?[-:0-9.] \\s*[ap]\\.?m\\.?", "--", t0)
t1[t1 != t0]
#  [1] "29th October 2022--"                                        
#  [2] "-- 02/11/22"                                                
#  [3] " between-- Thursday 27th October to Saturday 29th October--"
#  [4] "04/09/2022 at--"                                            
#  [5] "4/09/2022 at--"                                             
#  [6] "04/09/2022 at--"                                            
#  [7] "28.10.22 between-- and midnight"                            
#  [8] " Sunday 30th October 2022 between-- and--"                  
#  [9] "30th October, approx--"                                     
# [10] "03/11/2022--"                                               
# [11] " Between-- and-- on 31/10/2022"                             
# [12] "---- 31 oct 2022"                                           
# [13] "31/10/22--"                                                 
# [14] " Tuesday 25th October 2022.--"                              
# [15] "30/10/2022--"                                               

The only difference between that and your professed "desired result" is in [12],

t1[t1 != t0][12]
# [1] "---- 31 oct 2022"
t2[12]
# [1] "----- 31 oct 2022"
  • Related