Home > Enterprise >  extract substring from large string
extract substring from large string

Time:11-11

I have a string as:

string="(2021-07-02 01:00:00 AM BST)  
---  
syl.hs has joined the conversation  
  
  

(2021-07-02 01:00:23 AM BST)  
---  
e.wang  
Good Morning
How're you?
  
  
  

(2021-07-02 01:05:11 AM BST)  
---  
wk.wang  
Hi, I'm Good.  
  
  

(2021-07-02 01:08:01 AM BST)  
---  
perter.derrek   
we got the update on work. 
It will get complete by next week.

(2021-07-15 08:59:41 PM BST)  
---  
ad.ft has left the conversation  
  
  
  
  
---  
  
* * *"

I want to extract the conversation text only (text in between name and timestamp) expected output as:

comments=['Good Morning How're you?','Hi, I'm Good.','we got the update on work.It will get complete by next week.']

What I have tried is:

comments=re.findall(r'---\s*\n(.(?:\n(?!(?:(\s\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s GMT\s*)\w \s*\n)?---).))',string)

CodePudding user response:

You could use a single capture group:

^---\s*\n(?!.* has (?:joined|left) the conversation|\* \* \*)\S.*((?:\n(?!\(\d|---).*)*)

The pattern matches:

  • ^ Start of string
  • ---\s*\n Match --- optional whitespace chars and a newline
  • (?!.* has (?:joined|left) the conversation|\* \* \*) Assert that the line does not contain a has joined or has left the conversation part, or contains * * *
  • \S.* Match at least a non whitespace char at the start of the line and the rest of the line
  • ( Capture group 1 (this will be returned by re.findall)
    • (?:\n(?!\(\d|---).*)* Match all lines the do not start with ( and a digit or --
  • ) Close group 1

See a regex demo and a Python demo.

Example

pattern = r"^---\s*\n(?!.* has (?:joined|left) the conversation|\* \* \*)\S.*((?:\n(?!\(\d|---).*)*)"
result = [m.strip() for m in re.findall(pattern, s, re.M) if m]
print(result)

Output

["Good Morning\nHow're you?", "Hi, I'm Good.", 'we got the update on work. \nIt will get complete by next week.']

CodePudding user response:

I've assumed:

  • The text of interest begins after a block of three lines: a line containing a timestamp, followed by the line "---", which may be padded to the right with spaces, followed by a line comprised of a string of letters containing one period which is neither at the beginning nor end of that string and that string may be padded on the right with spaces.
  • The block of text of interest may contain blank lines, a blank line being a string that contains nothing other than spaces and a line terminator.
  • The last line of the block of text of interest cannot be a blank line.

I believe the following regular expression (with multiline (m) and case-indifferent (i) flags set) meets these requirements.

^\(\d{4}\-\d{2}\-\d{2} .*\) *\r?\n-{3} *\r?\n[a-z] \.[a-z]  *\r?\n((?:.*[^ (\n].*\r?\n| *\r?\n(?=(?: *\r?\n)*(?!\(\d{4}\-\d{2}\-\d{2} .*\)).*[^ (\n]))*)

The blocks of lines of interest are contained in capture group 1.

Start your engine!

The elements of the expression are as follows.

^\(\d{4}\-\d{2}\-\d{2} .*\) *\r?\n  # match timestamp line
-{3} *\r?\n                         # match 3-hyphen line
[a-z] \.[a-z]  *\r?\n               # match name
(                                   # begin capture group 1
  (?:                               # begin non-capture group (a)
    .*[^ (\n].*\r?\n                # match a non-blank line
    |                               # or
    \ *\r?\n                        # match a blank line
    (?=                             # begin a positive lookahead
      (?:                           # begin non-capture group (b)
        \ *\r?\n                    # match a blank line
      )*                            # end non-capture group b and execute 0  times
      (?!                           # begin a negative lookahead
        \(\d{4}\-\d{2}\-\d{2} .*\)  # match timestamp line
      )                             # end negative lookahead
      .*[^ (\n]                     # march a non-blank line
    )                               # end positive lookahead
  )*                                # end non-capture group a and execute 0  times
)                                   # end capture group 1
  • Related