After being helped, I finally managed to apply Regex on a text to try to find some patterns.
My project consists of finding dialogues in a text written in Portuguese. In Portuguese, dialogues can be found in some ways: between dashes (- ele disse que sim-), with a dash starting the dialogue (- ele disse que sim), and in between quotation marks ("eu acho que sim").
However, as words in Portuguese can also contain dashes, such as in "viu-me" or "disse-lhe", I had make a code that takes all of this information into account.
The problem I am having is that I am getting dashes when I search for the pattern in a text.
Here is my code:
text = '''
"Para muitos é mais do que isso."
Eles chegarem em casa são e salvos
Viu-se que eles não estavam lá
'''
for d in re.finditer(r'(". ")|(^\s?-\s. \s|-)', text, re.MULTILINE):
print(d.group())
Here is the current output:
"Para muitos é mais do que isso."
-
Fantastic, the code manages to find the dialogue in quotations, but prints a dash as well. It is as if it found that it is not a dialogue, it is just a word with an embedded dash, but still shows the dash in it.
The desired output:
"Para muitos é mais do que isso."
CodePudding user response:
Just put a $ sign at the last of your regex, to denote the end.
r'(". ")|(^\s?-\s. \s|-$)'
CodePudding user response:
The reason for this is because in (^\s?-\s. \s|-)
the ending with |-
is incorrect. It basically tells the regex to match \s?-\s. \s
OR a dash/hyphen. Which ends up matching the hyphen in Viu-se
, since there is no notion of spaces in |-
.
You'll probably also need to remove the ^
in the second group because if there are dashes in the middle of a sentence, you won't catch that.
Examples:
import re
text = '''
"Para muitos é mais do que isso."
Eles chegarem em casa são e salvos
Viu-se que eles não estavam lá
hello - More text and example - and stuff
a confusing-example-with-hyphens
Here is something else
- Start with dashes -, "quote me here"
'''
rgx = r'(". ")|(\s?-\s. \s-)'
for d in re.finditer(rgx, text, re.MULTILINE):
print(d.group())
Gets you:
"Para muitos é mais do que isso."
- More text and example -
- Start with dashes -
"quote me here"
N.B: You could also control the exact number of spaces you want to see in case you don't want to match on multiple spaces after a dash;
rgx = r'(". ")|(\s?-\s{1}. \s{1}-)'