Home > Back-end >  Regex with Transcript data (Theathre)
Regex with Transcript data (Theathre)

Time:11-17

**Jacke:** Some text... is here?
**Hipster:** Some text... is here?
**Hipster** (ruft): Some text... is here?
**Aluhut** (brüllt hysterisch): Some text... is here?

I tried with regex to get only the text, not the first words... so only this

Some text... is here?
Some text... is here?
Some text... is here?
Some text... is here?

is that possible? this is my regex until now..

([*]{2}[a-zA-Zäüö:] [*]{2}[:]*.)([^\n]*)

UPDATE: Is there a way to filter out something like this

*test* Hello it's me
(test) Hello it's me
(*test*) Hello it's me *test*
*(test)* Hello it's me **test**

Result for all of this is:

Hello it's me
Hello it's me
Hello it's me
Hello it's me
``

CodePudding user response:

You could optionally match the trailing part between parenthesis, and omit the first capture group to match the whole first part.

Capture the part that follows in a capture group.

[*]{2}[a-zA-Zäüö:] [*]{2}:*(?:\s*\([^()]*\):)?\s (\S.*)

The pattern matches:

  • [*]{2}[a-zA-Zäüö:] Match ** and 1 occurrences of any char listed in the character class
  • [*]{2}:* Match ** and optional :
  • (?:\s*\([^()]*\):)? Optionally match whitespace chars followed by a part in parenthesis and :
  • \s Match 1 whitespace chars
  • (\S.*) Capture group 1, match a non whitespace char and the rest of the line

regex demo

Example code using re.findall which gives the value of the capture group:

import re
 
regex = r"[*]{2}[a-zA-Zäüö:] [*]{2}:*(?:\s*\([^()]*\):)?\s (\S.*)"
 
s = ("**Jacke:** Some text... is here?\n"
    "**Hipster:** Some text... is here?\n"
    "**Hipster** (ruft): Some text... is here?\n"
    "**Aluhut** (brüllt hysterisch): Some text... is here?")
 
print(re.findall(regex, s))

Output

[
'Some text... is here?',
'Some text... is here?',
'Some text... is here?',
'Some text... is here?'
]
  • Related