I have to search for any occurrence of The XXth (?:and XXth)? session of the XX body
It can be any session and there are several bodies. I've come up with a pattern that finds them when they are unique in a sentence, but that fails when there is more than one repetition of that text. See an example beneath:
import re
test = """1. The thirty-fifth session of the Subsidiary Body for Implementation (SBI) was held at the International
Convention Centre and Durban Exhibition Centre in Durban, South Africa, from 28 November to 3 December 2011. 10.
Forum on the impact of the implementation of response measures at the thirty-fourth and thirty-fifth sessions of the
subsidiary bodies, with the objective of developing a work programme under the Subsidiary Body for Scientific and
Technological Advice and the Subsidiary Body for Implementation to address these impacts, with a view to adopting,
at the seventeenth session of the Conference of the Parties, modalities for the operationalization of the work
program and a possible forum on response measures.[^6] """
pattern = re.compile(r".*(The [\w\s-]* sessions? of the (?:Subsidiary Body for Implementation|Conference of the "
r"Parties|subsidiary bodies))", re.IGNORECASE)
print(pattern.findall(test))
This prints: ['The thirty-fifth session of the Subsidiary Body for Implementation', 'the seventeenth session of the Conference of the Parties']
and I would like to get: ['The thirty-fifth session of the Subsidiary Body for Implementation', 'the thirty-fourth and thirty-fifth sessions of the subsidiary bodies', 'the seventeenth session of the Conference of the Parties']
I think that the problem is that the pattern is too wide, but not sure how to constrain it because I tan end in different ways...
Any clue of how to improve this result?
CodePudding user response:
Is there a reason for the .*
at the beginning of your regex?
If I understand findall
correctly, you don't need that
CodePudding user response:
The problem is that there is and <NUMERAL>
after a numeral. You can use
The\s \S (?:\s and\s \S )?\s sessions?\s of\s the\s (?:Subsidiary\s Body\s for\s Implementation|Conference\s of\s the\s Parties|subsidiary\s bodies)
See the regex demo.
Details:
The
- a fixed string\s \S
- one or more whitespaces and one or more non-whitespace chars(?:\s and\s \S )?
- an optional sequence ofand
enclosed with one or more whitespace chars and then one or more non-whitespace chars\s
- one or more whitespacessessions?
-session
orsessions
\s of\s the
- one or more whitespaces,of
, one or more whitespaces,the
\s
- one or more whitespaces(?:
- start of a non-capturing group:Subsidiary\s Body\s for\s Implementation
-Subsidiary
one or more whitespacesBody
one or more whitespacesfor
one or more whitespacesImplementation
|
- orConference\s of\s the\s Parties
-Conference
one or more whitespacesof
one or more whitespacesthe
one or more whitespacesParties
|
- orsubsidiary\s bodies
-subsidiary
one or more whitespacesbodies
)
- end of the group.