I was trying to find the time pattern which contains any kind of am
or pm
with number before them and wanted to replace the whole pattern with --
.
What I thought was to find the string with am
or pm
which may or may not contain a dot .
before/between/after them, and then extract together with any number pattern before them until I reach a white space.
Here is the original data t0
:
t0 <- c("29th October 2022 5-6pm", "12-1pm 02/11/22", "10:25 bike rack at bexley college erith", "November 2nd 2022, apm shop ", " between 7pm Thursday 27th October to Saturday 29th October 9am", "04/09/2022 at 4 a.m.", "4/09/2022 at 4.a.m.", "04/09/2022 at 4.a.m" , "28.10.22 between 1.30pm and midnight", " Sunday 30th October 2022 between 11am and 3pm", "30th October, approx 6pm", "03/11/2022", "02/11/22 at campus", "Between 15:15 and 21:10", "03/11/2022 7pm", " Between 5:30pm and 6:30pm on 31/10/2022", "10am-2pm 31 oct 2022", "31/10/22 5.15am", " Tuesday 25th October 2022. 10:30pm", "30/10/2022 6pm")
I then create two variables, t1
and t2
, to store the search result and the gsub
result, this is what I get:
library("stringr")
t1 <- t0[str_detect(t0, "\\s[\\s|0-9|\\.|:] a\\.?m\\.?|p\\.?m\\.?")]
t2 <- t1 %>% gsub("\\s[\\s|0-9|\\.|:] a\\.?m\\.?|p\\.?m\\.?","--", .)
> t1
[1] "29th October 2022 5-6pm" "12-1pm 02/11/22"
[3] "November 2nd 2022, apm shop " " between 7pm Thursday 27th October to Saturday 29th October 9am"
[5] "04/09/2022 at 4 a.m." "4/09/2022 at 4.a.m."
[7] "04/09/2022 at 4.a.m" "28.10.22 between 1.30pm and midnight"
[9] " Sunday 30th October 2022 between 11am and 3pm" "30th October, approx 6pm"
[11] "03/11/2022 7pm" " Between 5:30pm and 6:30pm on 31/10/2022"
[13] "10am-2pm 31 oct 2022" "31/10/22 5.15am"
[15] " Tuesday 25th October 2022. 10:30pm" "30/10/2022 6pm"
> t2
[1] "29th October 2022 5-6--" "12-1-- 02/11/22"
[3] "November 2nd 2022, a-- shop " " between 7-- Thursday 27th October to Saturday 29th October--"
[5] "04/09/2022 at 4 a.m." "4/09/2022 at--"
[7] "04/09/2022 at--" "28.10.22 between 1.30-- and midnight"
[9] " Sunday 30th October 2022 between-- and 3--" "30th October, approx 6--"
[11] "03/11/2022 7--" " Between 5:30-- and 6:30-- on 31/10/2022"
[13] "10am-2-- 31 oct 2022" "31/10/22--"
[15] " Tuesday 25th October 2022. 10:30--" "30/10/2022 6--"
While the desired result is:
> t2
[1] "29th October 2022--" "-- 02/11/22"
[3] " between-- Thursday 27th October to Saturday 29th October--" "04/09/2022 at--"
[5] "4/09/2022 at--" "04/09/2022 at--"
[7] "28.10.22 between-- and midnight" " Sunday 30th October 2022 between-- and--"
[9] "30th October, approx--" "03/11/2022--"
[11] " Between-- and-- on 31/10/2022" "----- 31 oct 2022"
[13] "31/10/22--" " Tuesday 25th October 2022.--"
[15] "30/10/2022--"
How should I correct the regex pattern?
CodePudding user response:
t1 <- gsub("\\s?[-:0-9.] \\s*[ap]\\.?m\\.?", "--", t0)
t1[t1 != t0]
# [1] "29th October 2022--"
# [2] "-- 02/11/22"
# [3] " between-- Thursday 27th October to Saturday 29th October--"
# [4] "04/09/2022 at--"
# [5] "4/09/2022 at--"
# [6] "04/09/2022 at--"
# [7] "28.10.22 between-- and midnight"
# [8] " Sunday 30th October 2022 between-- and--"
# [9] "30th October, approx--"
# [10] "03/11/2022--"
# [11] " Between-- and-- on 31/10/2022"
# [12] "---- 31 oct 2022"
# [13] "31/10/22--"
# [14] " Tuesday 25th October 2022.--"
# [15] "30/10/2022--"
The only difference between that and your professed "desired result" is in [12]
,
t1[t1 != t0][12]
# [1] "---- 31 oct 2022"
t2[12]
# [1] "----- 31 oct 2022"