Home > Back-end >  Extracting texts using regex
Extracting texts using regex

Time:10-27

I have the below string in a file. I am trying out regex to extract the paragraphs after "---" which is not shows as a text in this editor. The image below should give you an understanding of text.

( 2021-07-10 01:24:55 PM GMT )STEMAILTE

Badminton is a racquet sport played using racquets to hit a shuttlecock across a net. Although it may be played with larger teams, the most common forms of the game are "singles" (with one player per side) and "doubles" (with two players per side).

( 2021-07-10 01:27:55 PM GMT )ARAMASU

Both the academies run a residential training program for upcoming and talented footballers and Boxers. The Academies are functioning in Sarusajai Sports complex.

Have attached the image below -

captured_img

So far I have tried re.findall(r'([(){}[]][^\S]\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} [a-zA-Z]{2} [a-zA-Z]{3}[^\S][(){}[]][\w.] )(\n- )((\n. )) ',text) which just gives me the datetime and text after that.

I am trying to extract the three pharagraphs after the "---" in each groups from the above image.

CodePudding user response:

Why do you wanna use such a complex regex for this kinda simple task? Is there any variable separation needed or does your text have always have the same structure? If so, why not using something like this:

paragraph = text.splitlines()
paragraph = [line for line in text.splitlines() if not (line.startswith("---") or line.startswith("(") or not line)]
print(paragraph)

assuming your text is stored in a variable called text.

CodePudding user response:

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

  • Related