I have a string as:
string="(2021-07-02 01:00:00 AM BST)
---
syl.hs has joined the conversation
(2021-07-02 01:00:23 AM BST)
---
e.wang
Good Morning
How're you?
(2021-07-02 01:05:11 AM BST)
---
wk.wang
Hi, I'm Good.
(2021-07-02 01:08:01 AM BST)
---
perter.derrek
we got the update on work.
It will get complete by next week.
(2021-07-15 08:59:41 PM BST)
---
ad.ft has left the conversation
---
* * *"
I want to extract the conversation text only (text in between name and timestamp) expected output as:
comments=['Good Morning How're you?','Hi, I'm Good.','we got the update on work.It will get complete by next week.']
What I have tried is:
comments=re.findall(r'---\s*\n(.(?:\n(?!(?:(\s\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s GMT\s*)\w \s*\n)?---).))',string)
CodePudding user response:
You could use a single capture group:
^---\s*\n(?!.* has (?:joined|left) the conversation|\* \* \*)\S.*((?:\n(?!\(\d|---).*)*)
The pattern matches:
^
Start of string---\s*\n
Match---
optional whitespace chars and a newline(?!.* has (?:joined|left) the conversation|\* \* \*)
Assert that the line does not contain ahas joined
orhas left
the conversation part, or contains* * *
\S.*
Match at least a non whitespace char at the start of the line and the rest of the line(
Capture group 1 (this will be returned by re.findall)(?:\n(?!\(\d|---).*)*
Match all lines the do not start with(
and a digit or --
)
Close group 1
See a regex demo and a Python demo.
Example
pattern = r"^---\s*\n(?!.* has (?:joined|left) the conversation|\* \* \*)\S.*((?:\n(?!\(\d|---).*)*)"
result = [m.strip() for m in re.findall(pattern, s, re.M) if m]
print(result)
Output
["Good Morning\nHow're you?", "Hi, I'm Good.", 'we got the update on work. \nIt will get complete by next week.']
CodePudding user response:
I've assumed:
- The text of interest begins after a block of three lines: a line containing a timestamp, followed by the line
"---"
, which may be padded to the right with spaces, followed by a line comprised of a string of letters containing one period which is neither at the beginning nor end of that string and that string may be padded on the right with spaces. - The block of text of interest may contain blank lines, a blank line being a string that contains nothing other than spaces and a line terminator.
- The last line of the block of text of interest cannot be a blank line.
I believe the following regular expression (with multiline (m
) and case-indifferent (i
) flags set) meets these requirements.
^\(\d{4}\-\d{2}\-\d{2} .*\) *\r?\n-{3} *\r?\n[a-z] \.[a-z] *\r?\n((?:.*[^ (\n].*\r?\n| *\r?\n(?=(?: *\r?\n)*(?!\(\d{4}\-\d{2}\-\d{2} .*\)).*[^ (\n]))*)
The blocks of lines of interest are contained in capture group 1.
The elements of the expression are as follows.
^\(\d{4}\-\d{2}\-\d{2} .*\) *\r?\n # match timestamp line
-{3} *\r?\n # match 3-hyphen line
[a-z] \.[a-z] *\r?\n # match name
( # begin capture group 1
(?: # begin non-capture group (a)
.*[^ (\n].*\r?\n # match a non-blank line
| # or
\ *\r?\n # match a blank line
(?= # begin a positive lookahead
(?: # begin non-capture group (b)
\ *\r?\n # match a blank line
)* # end non-capture group b and execute 0 times
(?! # begin a negative lookahead
\(\d{4}\-\d{2}\-\d{2} .*\) # match timestamp line
) # end negative lookahead
.*[^ (\n] # march a non-blank line
) # end positive lookahead
)* # end non-capture group a and execute 0 times
) # end capture group 1