Home > Back-end >  Unexpected behavior with regular expressions
Unexpected behavior with regular expressions

Time:01-21

I am trying to write a parser that detects bibliography footnotes, using regular expressions. But a particular RE is not working, and I cannot figure out why. Here is the code where I isolated the problem.

import re
PATTERN = "[\\w ] , [\\w ] , (\\d (\\-\\d )?)\\."

match_A = re.search(PATTERN, "Author, Some Book, 51–66.")
match_B = re.search(PATTERN, "Author, Some Book, 60-61.")

print(match_A != None)
print(match_B != None)

SUB_PATTERN = "\\d (\\-\\d )?"

match_C = re.search(SUB_PATTERN, "51–66")
match_D = re.search(SUB_PATTERN, "60–61")

print(match_C != None)
print(match_D != None)

The result is:

False
True
True
True

But I expect to obtain all True. Can anybody reproduce this issue, or explain what is happening to me?

I am working on Windows 10. My Python version:

Python 3.11.1 (tags/v3.11.1:a7a450f, Dec  6 2022, 19:58:39) [MSC v.1934 64 bit (AMD64)] on win32

CodePudding user response:

Your dashes are different, the first one is a "–" ("en dash") and the second one is a "-" ("hyphen"). If you don't believe me, google each one. You can put them into a character class:

PATTERN = "[\\w ] , [\\w ] , (\\d ([–-]\\d )?)\\."

  • Related