Home > Enterprise >  Regex: How to extract dialogue tags from fiction, with speaker information
Regex: How to extract dialogue tags from fiction, with speaker information

Time:11-25

Totally stumped on this. I need help extracting dialogue from a story so I can hand it off for narration.

Basically, this is a problem where I have a big chunk of text (a novel), and I want to extract all the dialogue from the text in a format I can pipe into a spreadsheet.

But, I also want, if it exists, the speaker information as well. So, given a string like:

'"I'm really hungry," she said.'

I would like the values returned as:

[ "I'm really hungry", "she said" ]

If there is no dialogue, as in this example:

"I'm not hungry."

the result would just be:

["I'm really hungry."]

Is this madness? Is it even possible? I have fooled around with this regex (am not a regex guru, knowing only enough to be dangerous):

"([^"]*)"

Which seems to get the dialogue tags, but doesn't get the speaker info. Any advice in how to get the speaker info as well would be greatly appreciated. I've been wrestling with this for awhile now.

Maybe a better approach would be to get the dialogue in one field, and the entire paragraph it is found in as the second field. That could also work, but I have no idea where to start with this.

Basically I want to put these all into a spreadsheet so I can hand them off to a narrator with enough context that they know whose dialogue is who's in the story.

Any help is greatly appreciated!

CodePudding user response:

It definitely is possible

Look at this regex: ^.*?'?(?P<line>\".*\")(?P<actor>[^'\n]*)'?.*?$

demo here: https://regex101.com/r/UCRZwY/5

It basically marks the outer quotes as optional, but if it does find them, stores whatever provided as '$actor' (and the line as '$line') these are of course just names i've given them, feel free to change

Note updated to include such text as part of regular sentence, see example in demo

  • Related