extract substring from large string-CodePudding

I have a string as:

string="(2021-07-02 01:00:00 AM BST)  
---  
syl.hs has joined the conversation  
  
  

(2021-07-02 01:00:23 AM BST)  
---  
e.wang  
Good Morning
How're you?
  
  
  

(2021-07-02 01:05:11 AM BST)  
---  
wk.wang  
Hi, I'm Good.  
  
  

(2021-07-02 01:08:01 AM BST)  
---  
perter.derrek   
we got the update on work. 
It will get complete by next week.

(2021-07-15 08:59:41 PM BST)  
---  
ad.ft has left the conversation  
  
  
  
  
---  
  
* * *"

I want to extract the conversation text only (text in between name and timestamp) expected output as:

comments=['Good Morning How're you?','Hi, I'm Good.','we got the update on work.It will get complete by next week.']

What I have tried is:

comments=re.findall(r'---\s*\n(.(?:\n(?!(?:(\s\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s GMT\s*)\w \s*\n)?---).))',string)

CodePudding user response：

You could use a single capture group:

^---\s*\n(?!.* has (?:joined|left) the conversation|\* \* \*)\S.*((?:\n(?!\(\d|---).*)*)

The pattern matches:

^ Start of string
---\s*\n Match --- optional whitespace chars and a newline
(?!.* has (?:joined|left) the conversation|\* \* \*) Assert that the line does not contain a has joined or has left the conversation part, or contains * * *
\S.* Match at least a non whitespace char at the start of the line and the rest of the line
( Capture group 1 (this will be returned by re.findall)
- (?:\n(?!\(\d|---).*)* Match all lines the do not start with ( and a digit or --
) Close group 1

See a regex demo and a Python demo.

Example

pattern = r"^---\s*\n(?!.* has (?:joined|left) the conversation|\* \* \*)\S.*((?:\n(?!\(\d|---).*)*)"
result = [m.strip() for m in re.findall(pattern, s, re.M) if m]
print(result)

Output

["Good Morning\nHow're you?", "Hi, I'm Good.", 'we got the update on work. \nIt will get complete by next week.']

CodePudding user response：

I've assumed:

The text of interest begins after a block of three lines: a line containing a timestamp, followed by the line "---", which may be padded to the right with spaces, followed by a line comprised of a string of letters containing one period which is neither at the beginning nor end of that string and that string may be padded on the right with spaces.
The block of text of interest may contain blank lines, a blank line being a string that contains nothing other than spaces and a line terminator.
The last line of the block of text of interest cannot be a blank line.

I believe the following regular expression (with multiline (m) and case-indifferent (i) flags set) meets these requirements.

^\(\d{4}\-\d{2}\-\d{2} .*\) *\r?\n-{3} *\r?\n[a-z] \.[a-z]  *\r?\n((?:.*[^ (\n].*\r?\n| *\r?\n(?=(?: *\r?\n)*(?!\(\d{4}\-\d{2}\-\d{2} .*\)).*[^ (\n]))*)

The blocks of lines of interest are contained in capture group 1.

Start your engine!

The elements of the expression are as follows.

^\(\d{4}\-\d{2}\-\d{2} .*\) *\r?\n  # match timestamp line
-{3} *\r?\n                         # match 3-hyphen line
[a-z] \.[a-z]  *\r?\n               # match name
(                                   # begin capture group 1
  (?:                               # begin non-capture group (a)
    .*[^ (\n].*\r?\n                # match a non-blank line
    |                               # or
    \ *\r?\n                        # match a blank line
    (?=                             # begin a positive lookahead
      (?:                           # begin non-capture group (b)
        \ *\r?\n                    # match a blank line
      )*                            # end non-capture group b and execute 0  times
      (?!                           # begin a negative lookahead
        \(\d{4}\-\d{2}\-\d{2} .*\)  # match timestamp line
      )                             # end negative lookahead
      .*[^ (\n]                     # march a non-blank line
    )                               # end positive lookahead
  )*                                # end non-capture group a and execute 0  times
)                                   # end capture group 1