Home > Net >  Python regex fullmatch doesn't work as expected
Python regex fullmatch doesn't work as expected

Time:11-23

I have a text file that contains some sentences, I'm checking them if they are valid sentences based on some rules and writing valid or not valid to a seperate text file. My main problem is when I'm using ctrl f and enter my regex to search bar it matches the strings that I wanted to match but in code, it works wrong. Here is my code:

import re

pattern = re.compile('(([A-Z])[a-z\s,]*)((: ["‘][a-z,!?\.\s]*["’][.,!?])|(; [a-zA-Z\s]*[!.?])|(\s["‘][a-z,.;!?\s]*["’])|([\.?!]))')
text=open('validSentences',"w ")
with open('sentences.txt',encoding='utf8') as file:
    lines = file.readlines()
    for line in lines:
        matches = pattern.fullmatch(line)
        if(matches==None):
            text.write("not valid" "\n")
        else:
            text.write("valid" "\n") 
    file.close()

In documents it says that fullmatch matches only whole string matches and thats what I'm trying to do but this code writes not valid for all sentences that I have.
The text file that I have:

How can you say that to me? 
As he looked at his reflection in the mirror, he took a deep breath. 
He nodded at himself and, feeling braver, he stepped outside the bathroom. He bumped straight into the 
extremely tall man, who was waiting by the door. 
David said ‘Oh, sorry!’. 
The happy pair discussed their future life 2gether and shared sweet words of admiration. 
We will not stop you; I promise! 
Come here ASAP! 
He pushed his chair back and went to the kitchen at 2 pM. 
I do not know... 
The main character in the movie said: "Play hard. Work harder." 

When I enter my regex in vs code with ctrl f whole first, second, fourth, seventh and eight lines are highligting so according to fullmatch() funtion they need to print as "valid" but they aren't. I need help with this issue.

CodePudding user response:

First, remove lines = file.readlines() as it already moves the file handle to the end of the file stream. Then, you need to keep in mind that when using for line in lines:, the line variable has a trailing newline, so

  • Either use line=line.rstrip() to remove the trailing whitespace before running the regex or
  • Ensure your pattern ends in \n? (an optional newline), or even \s* (any zero or more whitespace).

So, a possible solution looks like

with open('sentences.txt',encoding='utf8') as file:
    for line in file:
        matches = pattern.fullmatch(line.rstrip('\n'))
...

Or,

pattern = re.compile(r'([A-Z][a-z\s,]*)(?:: ["‘][a-z,!?\.\s]*["’][.,!?]|; [a-zA-Z\s]*[!.?]|\s["‘][a-z,.;!?\s]*["’]|[.?!])\s*')
#...
with open('sentences.txt',encoding='utf8') as file:
    for line in file:
....
  • Related