I have string as mentioned below:
string="
(2021-07-04 11:58:43 PM BST)
---
len wee zim (Tradition (US) ) says to yohan sen
[[:Conversations will be recorded and may be monitored by the participants and
their employers:]] Hi yohan
(2021-07-05 12:04:42 AM BST)
---
len wee zim (Tradition (US) ) says to yohan sen
okay -5 / 0 .
(2021-07-04 11:47:14 PM BST)
---
Ke Cho Ki says to
Hano Cho
hello
(2021-07-05 12:09:41 AM BST)
---
len wee zim (Tradition (Asia)) says
to yohan sen
yes -5 / 0 TN -- / 2.5
---
* * *
Processed by wokl Archive for son malab | 2021-07-05 12:26:44 AM
BST
---"
All I want to extract the text after says to and before timestamp.
Expected output as:
text=['yohan sen [[:Conversations will be recorded and may be monitored by the participants and their employers:]] Hi yohan','yohan sen okay -5 / 0 ','Hano Cho hello','yohan sen yes -5 / 0 TN -- / 2.5']
What I have tried:
text=re.findall(r'\bsays to (.*(?:\n(?!\(\d|---).*?)*?)\s*\n(?:\(\d|---)', string)
CodePudding user response:
You may use this regex:
says\s to\s ((?:. \n) )
RegEx Details:
says\s to\s
: Matchessays to
followed by 1 whitespaces((?:. \n) )
: Match 1 non-empty lines and capture in group #1
Python Code:
matches = re.findall(r'says\s to\s ((?:. \n) )', string)
CodePudding user response:
(?<=says\sto)[\s\S]*?(?=\(\d{4}-\d{2}-\d{2}\s(\d\d:){2}\d{2}\s\w{2}\s\w{3}\))
You have to use look ahead and look behind regex for this. To solve your problem, you need one look behind, which is 'says to' and one look ahead which is the date pattern.
- Syntax for look behind is
(?<=fixed_length_regex)
- Syntax for look ahead is
(?=fixed_length_regex)
So essentially what you are looking for would look something like this:
look-behind | pattern | look-ahead
________________|_________________________|__________________
| |
(?<=(says\sto)) | match_everything_here | (?=date_pattern)
which is equivalent to first regex.
You can play around with the solution in regex101 here: https://regex101.com/r/rPFDo9/1/
CodePudding user response:
With your shown samples, please try following Python code. Written and tested in Python3.
import re
##Create variable here string with user's values, since variable is too long so mentioning it as a comment here....
var1 = re.findall(r'says\s [^(]*',string,re.M)
Above will create a list named var1
whose elements will have new lines at last of each element, so to remove them use following code then. Using strip
function of Python here.
var1 = list(map(lambda s: s.strip(), var1))
Now print the all elements of var1
list:
for element in var1:
print (element)
Explanation: Explanation of regex would be simple, using re.findall
function of Python3 and mentioning regex to match says\s [^(]*
means match from says followed by space(s) just before next/1st occurrence of ( here.