Whey am I getting a dash when I try to use Regex on a text?-CodePudding

After being helped, I finally managed to apply Regex on a text to try to find some patterns.

My project consists of finding dialogues in a text written in Portuguese. In Portuguese, dialogues can be found in some ways: between dashes (- ele disse que sim-), with a dash starting the dialogue (- ele disse que sim), and in between quotation marks ("eu acho que sim").

However, as words in Portuguese can also contain dashes, such as in "viu-me" or "disse-lhe", I had make a code that takes all of this information into account.

The problem I am having is that I am getting dashes when I search for the pattern in a text.

Here is my code:

    text = '''
"Para muitos é mais do que isso."

Eles chegarem em casa são e salvos

Viu-se que eles não estavam lá
'''


 for d in re.finditer(r'(". ")|(^\s?-\s. \s|-)', text, re.MULTILINE):
    print(d.group())

Here is the current output:

"Para muitos é mais do que isso."
-

Fantastic, the code manages to find the dialogue in quotations, but prints a dash as well. It is as if it found that it is not a dialogue, it is just a word with an embedded dash, but still shows the dash in it.

The desired output:

"Para muitos é mais do que isso."

CodePudding user response：

Just put a $ sign at the last of your regex, to denote the end.

r'(". ")|(^\s?-\s. \s|-$)'

CodePudding user response：

The reason for this is because in (^\s?-\s. \s|-) the ending with |- is incorrect. It basically tells the regex to match \s?-\s. \s OR a dash/hyphen. Which ends up matching the hyphen in Viu-se, since there is no notion of spaces in |-.

You'll probably also need to remove the ^ in the second group because if there are dashes in the middle of a sentence, you won't catch that.

Examples:

import re

text = '''
"Para muitos é mais do que isso."

Eles chegarem em casa são e salvos

Viu-se que eles não estavam lá

hello - More text and example - and stuff

a confusing-example-with-hyphens

Here is something else

- Start with dashes -, "quote me here"
'''

rgx = r'(". ")|(\s?-\s. \s-)'

for d in re.finditer(rgx, text, re.MULTILINE):
    print(d.group())

Gets you:

"Para muitos é mais do que isso."
 - More text and example -

- Start with dashes -
"quote me here"

N.B: You could also control the exact number of spaces you want to see in case you don't want to match on multiple spaces after a dash; rgx = r'(". ")|(\s?-\s{1}. \s{1}-)'