Regex to read formatted dialogue from play script-CodePudding

I wrote a script for a play in .txt, and am trying to scrape through it to get a list of lines the characters say. Here is a sample of the layout

GARVICK
Not gonna happen. We're here to help, not make things worse!.
(laughs)
This'll be a good one. Hot shot eh, leaves the office to come to the
underworld for some action. What's your problem!

LEON
Those corporate executives are crucial to our operations in this here
underworld. They're gonna get choked. We gotta stop it! I need someone
to get in and out real fast.  And I know you're the guy for the job!  
You gotta be good, I know your style, a little harsh but that's fine!

GARVICK
Tell me more.

when I copy this regex into pythex.org:

([a-zA-Z\s] )\n(. )\n

I get the following:

Match 1
1.  GARVICK #(this is fine)
2.  Not gonna happen. We're here to help, not make things worse!.
Match 2
1.  leaves the office to come to the
2.  underworld for some action. What's your problem!
Match 3
1.  LEON Those corporate executives are crucial to our operations in this here
2.  underworld. They're gonna get choked. We gotta stop it! I need someone
Match 4
1.  
2.  You gotta be good, I know your style, a little harsh but that's fine!

However ideally I believe I would want something like below. Is there a way to tweak my regex to do this?:

Match 1 
1. GARVICK
2. Not gonna happen. We're here to help, not make things worse!.

Match 2
1. GARVICK
2. This'll be a good one.
3. Hot shot eh, leaves the office to come to the underworld for some action. 
4. What's your problem!

Match 3
1. LEON
2. Those corporate executives are crucial to our operations in this here underworld. 
3. They're gonna get choked. 
4. We gotta stop it! 
5. I need someone to get in and out real fast.
6. And I know you're the guy for the job!
7. You gotta be good, I know your style, a little harsh but that's fine!

Match 4
1.  GARVICK
2.  Tell me more.

CodePudding user response：

Regex is powerful, and incredibly useful when you don't have a (firm) idea of the elements you are working with. Otherwise, it adds significant complexity and, in my opinion, should never be the first tool you reach for.

In this case you have a fixed layout, and a fixed list you want to use to break your text apart.

Example modified to merge together as many of the lines which has been broken across multiple lines. Also included an output for each character, dialogue in individual text files. Requires Python 3.9 for the walrus operator, if less than 3.9 replace all occurrences of "current_line" with line.strip().

# Set up storage variables
actors = ['GARVICK', 'LEON',]
dialogue_collated = []
current_text = []
current_actor = ""

with open('dialogue.txt', 'r') as f:
    for line in f.readlines():
        if (current_line:=line.strip()) in actors:
            dialogue_collated.append((current_actor, current_text))
            current_actor = current_line
            current_text = ""
            continue
        # accumulate the current set of dialogue
        if current_line[-1:] in ("!", ".", ")"):
            current_text  = line.strip()   "\n"
        else:
            current_text  = current_line   " "

# Remove the first empty entry
dialogue_collated.pop(0)

print(dialogue_collated)

for entry in dialogue_collated:
    print(entry[0]   '\n'   entry[1])
    with open(entry[0].replace(' ', '')   '.txt', 'w ') as fo:
        fo.write(entry[1])

The output will be a list of tuples, where each tuple represents the current speaker and the current set of dialogue. This is written to each individual text file - the important part is the 'w ', which means if a file exists the text will be appended rather than overwriting the existing file.