extract newline text from string-CodePudding

I have string as mentioned below:

string="
(2021-07-04 11:58:43 PM BST)  
---  
len wee zim (Tradition (US) ) says to yohan sen  
[[:Conversations will be recorded and may be monitored by the participants and
their employers:]] Hi yohan  
  
  

(2021-07-05 12:04:42 AM BST)  
---  
len wee zim (Tradition (US) ) says to yohan sen  
okay -5 / 0  .



(2021-07-04 11:47:14 PM BST)  
---  
Ke Cho Ki says to
Hano Cho  
hello 
       

(2021-07-05 12:09:41 AM BST)  
---  
len wee zim (Tradition (Asia)) says 
to yohan sen  
yes -5 / 0 TN -- /  2.5  
  
  
---  
  
* * *

Processed by wokl Archive for son malab | 2021-07-05 12:26:44 AM
BST  
---"

All I want to extract the text after says to and before timestamp.

Expected output as:

text=['yohan sen [[:Conversations will be recorded and may be monitored by the participants and their employers:]] Hi yohan','yohan sen okay -5 / 0 ','Hano Cho hello','yohan sen yes -5 / 0 TN -- /  2.5']

What I have tried:

text=re.findall(r'\bsays to (.*(?:\n(?!\(\d|---).*?)*?)\s*\n(?:\(\d|---)', string)

CodePudding user response：

You may use this regex:

says\s to\s ((?:. \n) )

RegEx Demo

RegEx Details:

says\s to\s : Matches says to followed by 1 whitespaces
((?:. \n) ): Match 1 non-empty lines and capture in group #1

Python Code:

matches = re.findall(r'says\s to\s ((?:. \n) )', string)

CodePudding user response：

(?<=says\sto)[\s\S]*?(?=\(\d{4}-\d{2}-\d{2}\s(\d\d:){2}\d{2}\s\w{2}\s\w{3}\))

You have to use look ahead and look behind regex for this. To solve your problem, you need one look behind, which is 'says to' and one look ahead which is the date pattern.

Syntax for look behind is (?<=fixed_length_regex)
Syntax for look ahead is (?=fixed_length_regex)

So essentially what you are looking for would look something like this:

   look-behind  |        pattern          |  look-ahead
________________|_________________________|__________________
                |                         |
(?<=(says\sto)) |  match_everything_here  | (?=date_pattern)

which is equivalent to first regex.

You can play around with the solution in regex101 here: https://regex101.com/r/rPFDo9/1/

CodePudding user response：

With your shown samples, please try following Python code. Written and tested in Python3.

import re
##Create variable here string with user's values, since variable is too long so mentioning it as a comment here....
var1 = re.findall(r'says\s [^(]*',string,re.M)

Above will create a list named var1 whose elements will have new lines at last of each element, so to remove them use following code then. Using strip function of Python here.

var1 = list(map(lambda s: s.strip(), var1))

Now print the all elements of var1 list:

for element in var1:
    print (element)

Explanation: Explanation of regex would be simple, using re.findall function of Python3 and mentioning regex to match says\s [^(]* means match from says followed by space(s) just before next/1st occurrence of ( here.