regex to match paragraph in between 2 substrings-CodePudding

I have a string look like this:

string=""
( 2021-07-10 01:24:55 PM GMT )TEST  
---  
Badminton is a racquet sport played using racquets to hit a shuttlecock across
a net. Although it may be played with larger teams, the most common forms of
the game are "singles" (with one player per side) and "doubles" (with two
players per side).  
  
  

  

( 2021-07-10 01:27:55 PM GMT )PATRICKWARR  
---  
Good morning, I am doing well. And you?  
  
  

  
  
  
---  
  
  
  
  
---  
  
* * *""

I am trying to split the String up into parts as:

text=['Badminton is a racquet sport played using racquets to hit a shuttlecock across a net. Although it may be played with larger teams, the most common forms of the game are "singles" (with one player per side) and "doubles" (with two players per side).','Good morning, I am doing well. And you?']

What I have tried as:

text=re.findall(r'\( \d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2} PM GMT \)\w   [\S\n]---  .*',string)

I'm not able get how to extract multiple lines.

CodePudding user response：

You can use

(?m)^\(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s GMT\s*\)\w \s*\n---\s*\n(.*(?:\n(?!(?:\(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s GMT\s*\)\w \s*\n)?---).*)*)

See the regex demo. Details:

^ - start of line
{left_rx} - left boundary
--- - three hyphens
\s*\n - zero or more whitespaces and then an LF char
(.*(?:\n(?!(?:{left_rx})?---).*)*) - Group 1:
- .* - zero or more chars other than line break chars as many as possible
- (?:\n(?!(?:{left_rx})?---).*)* - zero or more (even empty, due to .*) lines that do not start with the (optional) left boundary pattern followed with ---

The boundary pattern defined in left_rx is \(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s GMT\s*\)\w \s*\n, it is basically the same as the original, I used \s* to match any zero or more whitespaces or \s to match one or more whitespaces between "words".

See the Python demo:

import re
text = '''string=""\n( 2021-07-10 01:24:55 PM GMT )TEST  \n---  \nBadminton is a racquet sport played using racquets to hit a shuttlecock across\na net. Although it may be played with larger teams, the most common forms of\nthe game are "singles" (with one player per side) and "doubles" (with two\nplayers per side).  \n  \n  \n\n  \n\n( 2021-07-10 01:27:55 PM GMT )PATRICKWARR  \n---  \nGood morning, I am doing well. And you?  \n  \n  \n\n  \n  \n  \n---  \n  \n  \n  \n  \n---  \n  \n* * *""'''
left_rx = r"\(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s GMT\s*\)\w \s*\n"
rx = re.compile(fr"^{left_rx}---\s*\n(.*(?:\n(?!(?:{left_rx})?---).*)*)", re.M)
print ( [x.strip().replace('\n', ' ') for x in rx.findall(text)] )

Output:

['Badminton is a racquet sport played using racquets to hit a shuttlecock across a net. Although it may be played with larger teams, the most common forms of the game are "singles" (with one player per side) and "doubles" (with two players per side).', 'Good morning, I am doing well. And you?']

CodePudding user response：

One of the approaches:

import re
# Replace all \n with ''
string = string.replace('\n', '')

# Replace the date string '( 2021-07-10 01:27:55 PM GMT )PATRICKWARR ' and string like '* * *' with ''
string = re.sub(r"\(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2} [AP]M GMT\s*\)\w |\* ", '', string)

data = string.split('---')
data = [item.strip() for item in data if item.strip()]
print (data)

Output:

['Badminton is a racquet sport played using racquets to hit a shuttlecock acrossa net. Although it may be played with larger teams, the most common forms ofthe game are "singles" (with one player per side) and "doubles" (with twoplayers per side).', 'Good morning, I am doing well. And you?']