I want to match this text:
<SERIES>
<OWNER-CIK>0000003521
<SERIES-ID>S000020958
<SERIES-NAME>Alger Small Cap Focus Fund
<CLASS-CONTRACT>
<CLASS-CONTRACT-ID>C000059340
<CLASS-CONTRACT-NAME>Alger Small Cap Focus Fund Class I
<CLASS-CONTRACT-TICKER-SYMBOL>AOFIX
</CLASS-CONTRACT>
<CLASS-CONTRACT>
<CLASS-CONTRACT-ID>C000095961
<CLASS-CONTRACT-NAME>Alger Small Cap Focus Fund Class Z
<CLASS-CONTRACT-TICKER-SYMBOL>AGOZX
</CLASS-CONTRACT>
<CLASS-CONTRACT>
<CLASS-CONTRACT-ID>C000179520
<CLASS-CONTRACT-NAME>Class Y
<CLASS-CONTRACT-TICKER-SYMBOL>AOFYX
</CLASS-CONTRACT>
</SERIES>
<SERIES>
From:
<SERIES>
Untill
</SERIES>
I'm trying with:
<SERIES>[^/]
but it fails at the line with:
</CLASS-CONTRACT>
If I add the S to the regex in finish even earlier since it ends with any of the character / or S appears. I need that both apear /S in that specific order
CodePudding user response:
Just use .*?
between the end anchors. You'll need re.S
so the .
matches newlines. The ?
makes it the shortest match, in case the ending anchor appears multiple times.
So the full string would be
r"<SERIES>.*?</SERIES>"
CodePudding user response:
This should work. It uses a lookahead so it knows when to stop.
import re
pattern = re.compile(r'<SERIES>.*(?=\n<SERIES&)',re.S)
print(pattern.findall(text)[0])
output.
<SERIES>
<OWNER-CIK>0000003521
<SERIES-ID>S000020958
<SERIES-NAME>Alger Small Cap Focus Fund
<CLASS-CONTRACT>
<CLASS-CONTRACT-ID>C000059340
<CLASS-CONTRACT-NAME>Alger Small Cap Focus Fund Class I
<CLASS-CONTRACT-TICKER-SYMBOL>AOFIX
</CLASS-CONTRACT>
<CLASS-CONTRACT>
<CLASS-CONTRACT-ID>C000095961
<CLASS-CONTRACT-NAME>Alger Small Cap Focus Fund Class Z
<CLASS-CONTRACT-TICKER-SYMBOL>AGOZX
</CLASS-CONTRACT>
<CLASS-CONTRACT>
<CLASS-CONTRACT-ID>C000179520
<CLASS-CONTRACT-NAME>Class Y
<CLASS-CONTRACT-TICKER-SYMBOL>AOFYX
</CLASS-CONTRACT>
</SERIES>