Home > Software design >  Extract text between two dates in R
Extract text between two dates in R

Time:09-16

There is a long text column and I want to extract the string between the two dates. The two dates the last two dates in the string.

df <- data.frame (f1  = c("Today is test 2021-09-15", 
                           "This is to be done today. 2020-04-05.Today is going tobe  2021-09-15",
                           "Great Novel. 2018-08-09.This is to be done today. 2020-04-05.The lion is an animal 2021-09-15", 
                           "This is to be done today. 2020-04-05.Today is test 2021-09-01.Monday is the first day 2021-08-02"
                           )
                        
)

Expected output is : Today is test
Today is going to be
The lion is an animal
Monday is the first day

I am able to extract the last two dates but somehow not able to extract the text between the two dates. If there is the only one date then whole text before the text to be there. Please guide.

CodePudding user response:

You may use -

sub('(?:.*\\d -\\d -\\d \\.)?(.*?)\\s \\d -\\d -\\d $', '\\1', df$f1)

#[1] "Today is test"           "Today is going tobe"    
#[3] "The lion is an animal"   "Monday is the first day"

where -

(?:.*\\d -\\d -\\d \\.) - is an optional non-capturing group for a date. It is kept as optional because the 1st value does not have a date preceding the text that we want to extract. Since regex are greedy the .* initially ensures that the date is the second last date in the text.

(.*?) is a capturing group extracting everything from the 1st group till the next date which is end of the text (\\d -\\d -\\d $).

CodePudding user response:

You can try:

sub(".*?([^-0-9.] )[-0-9] $", "\\1", df$f1)
#[1] "Today is test "           "Today is going tobe  "   
#[3] "The lion is an animal "   "Monday is the first day "

Where .*? matches everything non greedy, [^-0-9.] matches everything but not -0-9., [-0-9] matches -0-9 and $ is the end of the string.

CodePudding user response:

You can use lookarounds:

library(stringr)
str_extract(df$f1, "(?<=\\.|^)\\D (?=\\s[-\\d] $)")
[1] "Today is test"           "Today is going tobe "    "The lion is an animal"   "Monday is the first day"

How this works:

  • (?<=\\.|^): positive lookbehind asserting that the target string (in your case, the text) must be preceded by either the start of the string (^) or a period .
  • \\D : the target string expressed as a negative character class allowing any character that is not a digit
  • (?=\\s[-\\d] $): positive lookahaed asserting that the target string must be followed by a whitespace followed by any combination and number of - and digits.
  • Related