I am trying to write a parser that detects bibliography footnotes, using regular expressions. But a particular RE is not working, and I cannot figure out why. Here is the code where I isolated the problem.
import re
PATTERN = "[\\w ] , [\\w ] , (\\d (\\-\\d )?)\\."
match_A = re.search(PATTERN, "Author, Some Book, 51–66.")
match_B = re.search(PATTERN, "Author, Some Book, 60-61.")
print(match_A != None)
print(match_B != None)
SUB_PATTERN = "\\d (\\-\\d )?"
match_C = re.search(SUB_PATTERN, "51–66")
match_D = re.search(SUB_PATTERN, "60–61")
print(match_C != None)
print(match_D != None)
The result is:
False
True
True
True
But I expect to obtain all True
.
Can anybody reproduce this issue, or explain what is happening to me?
I am working on Windows 10. My Python version:
Python 3.11.1 (tags/v3.11.1:a7a450f, Dec 6 2022, 19:58:39) [MSC v.1934 64 bit (AMD64)] on win32
CodePudding user response:
Your dashes are different, the first one is a "–" ("en dash") and the second one is a "-" ("hyphen"). If you don't believe me, google each one. You can put them into a character class:
PATTERN = "[\\w ] , [\\w ] , (\\d ([–-]\\d )?)\\."