I want to achieve the following:
From the this kind of string "dsff § 15 Absatz 1 Kreislaufwirtschaftsgesetz dsfsf § 17 dssfsdf § 18 Absatz 4 und 5 Kreislaufwirtschaftsgesetz"
(the string might vary), I want to extract [('§ 15', 'Kreislaufwirtschaftsgesetz'), ('§ 18','Kreislaufwirtschaftsgesetz')]
.
I only got so far:
parts = re.findall(r"(§ \s\d [a-z]{0,1})(.*?gesetz)", text)
But the result is [('§ 15', ' Absatz 1 Kreislaufwirtschaftsgesetz'), ('§ 17', ' dssfsdf § 18 Absatz 4 und 5 Kreislaufwirtschaftsgesetz')]
CodePudding user response:
Instead of a lazy match, you can use [^§]
, assuming this symbol only appears in the section numbers.
parts = re.findall(r"(§ \s\d [a-z]{0,1})[^§]*?(\w*gesetz)", text)
CodePudding user response:
You can use
import re
text = "dsff § 15 Absatz 1 Kreislaufwirtschaftsgesetz dsfsf § 17 dssfsdf § 18 Absatz 4 und 5 Kreislaufwirtschaftsgesetz"
print( re.findall(r'(§\s*\d )[^§]*?(\S*gesetz)\b', text) )
# => [('§ 15', 'Kreislaufwirtschaftsgesetz'), ('§ 18', 'Kreislaufwirtschaftsgesetz')]
Also, if you can have punctuation glued on the left of the second group value and you only want to match word chars, replace \S
with \w
(or [^\W\d_]
to match just any letters).
See the Python demo and the regex demo. Details:
(§\s*\d )
- Group 1:§
, zero or more whitespaces, one or more digits[^§]*?
- any zero or more chars other than§
, as few as possible(\S*gesetz)
- Group 2: zero or more non-whitespace chars\b
- word boundary.
CodePudding user response:
One of possible solutions is to insert between both capturing groups:
(?:\s[A-Z] \s\d )
- a non-capturing group to match Absatz number or und number, repeating one or more times,\s
- to match a space after them.
In order to match upper- or lower-case letters, re.IGNORECASE flag should also be passed.
So the whole code can be:
re.findall(r"(§ \s\d [a-z]?)(?:\s[A-Z] \s\d ) \s(.*?gesetz)", text, flags=re.IGNORECASE)