I have encountered a behavior with re.match that I do not understand.
I am contributing to an open-source package that recognized legal citations. Many citations indicate which section number they are citing to, but there are different ways of representing that something is a section (§, sec., section, sections, §§, S., etc.). I wrote a regex that matches all of these possible representations.
import re
short_examples = {"§", "§§", "section", "Sec.", "secs.", "Sections"}
short_regex = re.compile("(§§?)|([Ss]((ec)(tion)?)?s?\.?)")
for example in short_examples:
re.match(short_regex, example)
I then tried to expand the examples to include full citations.
full_examples = {
"1 U.S.C. § 1",
"1 U.S.C. §§ 1",
"1 U.S.C. section 1",
"1 U.S.C. Sec. 1",
}
full_regex = re.compile("1 U.S.C. (§§?)|([Ss]((ec)(tion)?)?s?\.?) 1")
for example in full_examples:
re.match(full_regex, example)
This time, the first two examples matched, but the third and fourth did not and I cannot figure out why. The only difference between the strings that match and the strings that do not is the representation of the section, and all four representations of the section match in isolation (e.g. as short examples). I thought maybe the issue was that, with the "section" and "Sec." examples re.match was matching the "S" (itself a valid representation) and treating the rest of the substring as superflous, but that would be non-greedy and also it is not encountering that issue when there are two section symbols, even though one alone is a valid representation. I cannot think of what else the issue could be.
Can somebody please help me understand what I am doing wrong?
CodePudding user response:
You need to put a group around the alternatives.
1 U\.S\.C\. (?:(§§?)|([Ss]((ec)(tion)?)?s?\.?)) 1
Otherwise 1 U.S.C.
is only part of the first alternative that matches §
, and the section number at the end is part of the second alternative that matches the word Section
.