Regex extraction of beginning and end of substrings-CodePudding

I want to achieve the following: From the this kind of string "dsff § 15 Absatz 1 Kreislaufwirtschaftsgesetz dsfsf § 17 dssfsdf § 18 Absatz 4 und 5 Kreislaufwirtschaftsgesetz" (the string might vary), I want to extract [('§ 15', 'Kreislaufwirtschaftsgesetz'), ('§ 18','Kreislaufwirtschaftsgesetz')].

I only got so far:

parts = re.findall(r"(§ \s\d [a-z]{0,1})(.*?gesetz)", text)

But the result is [('§ 15', ' Absatz 1 Kreislaufwirtschaftsgesetz'), ('§ 17', ' dssfsdf § 18 Absatz 4 und 5 Kreislaufwirtschaftsgesetz')]

CodePudding user response：

Instead of a lazy match, you can use [^§], assuming this symbol only appears in the section numbers.

parts = re.findall(r"(§ \s\d [a-z]{0,1})[^§]*?(\w*gesetz)", text)

Try it online!

CodePudding user response：

You can use

import re
text = "dsff § 15 Absatz 1 Kreislaufwirtschaftsgesetz dsfsf § 17 dssfsdf § 18 Absatz 4 und 5 Kreislaufwirtschaftsgesetz"
print( re.findall(r'(§\s*\d )[^§]*?(\S*gesetz)\b', text) )
# => [('§ 15', 'Kreislaufwirtschaftsgesetz'), ('§ 18', 'Kreislaufwirtschaftsgesetz')]

Also, if you can have punctuation glued on the left of the second group value and you only want to match word chars, replace \S with \w (or [^\W\d_] to match just any letters). See the Python demo and the regex demo. Details:

(§\s*\d ) - Group 1: §, zero or more whitespaces, one or more digits
[^§]*? - any zero or more chars other than §, as few as possible
(\S*gesetz) - Group 2: zero or more non-whitespace chars
\b - word boundary.

CodePudding user response：

One of possible solutions is to insert between both capturing groups:

(?:\s[A-Z] \s\d ) - a non-capturing group to match Absatz number or und number, repeating one or more times,
\s - to match a space after them.

In order to match upper- or lower-case letters, re.IGNORECASE flag should also be passed.

So the whole code can be:

re.findall(r"(§ \s\d [a-z]?)(?:\s[A-Z] \s\d ) \s(.*?gesetz)", text, flags=re.IGNORECASE)