Home > Blockchain >  Multiple repetitions of a regex pattern
Multiple repetitions of a regex pattern

Time:04-09

I have to search for any occurrence of The XXth (?:and XXth)? session of the XX body It can be any session and there are several bodies. I've come up with a pattern that finds them when they are unique in a sentence, but that fails when there is more than one repetition of that text. See an example beneath:

import re
test = """1. The thirty-fifth session of the Subsidiary Body for Implementation (SBI) was held at the International 
Convention Centre and Durban Exhibition Centre in Durban, South Africa, from 28 November to 3 December 2011. 10. 
Forum on the impact of the implementation of response measures at the thirty-fourth and thirty-fifth sessions of the 
subsidiary bodies, with the objective of developing a work programme under the Subsidiary Body for Scientific and 
Technological Advice and the Subsidiary Body for Implementation to address these impacts, with a view to adopting, 
at the seventeenth session of the Conference of the Parties, modalities for the operationalization of the work 
program and a possible forum on response measures.[^6] """
pattern = re.compile(r".*(The [\w\s-]* sessions? of the (?:Subsidiary Body for Implementation|Conference of the "
                     r"Parties|subsidiary bodies))", re.IGNORECASE) 

print(pattern.findall(test))

This prints: ['The thirty-fifth session of the Subsidiary Body for Implementation', 'the seventeenth session of the Conference of the Parties'] and I would like to get: ['The thirty-fifth session of the Subsidiary Body for Implementation', 'the thirty-fourth and thirty-fifth sessions of the subsidiary bodies', 'the seventeenth session of the Conference of the Parties']

I think that the problem is that the pattern is too wide, but not sure how to constrain it because I tan end in different ways...

Any clue of how to improve this result?

CodePudding user response:

Is there a reason for the .* at the beginning of your regex?

If I understand findall correctly, you don't need that

CodePudding user response:

The problem is that there is and <NUMERAL> after a numeral. You can use

The\s \S (?:\s and\s \S )?\s sessions?\s of\s the\s (?:Subsidiary\s Body\s for\s Implementation|Conference\s of\s the\s Parties|subsidiary\s bodies)

See the regex demo.

Details:

  • The - a fixed string
  • \s \S - one or more whitespaces and one or more non-whitespace chars
  • (?:\s and\s \S )? - an optional sequence of and enclosed with one or more whitespace chars and then one or more non-whitespace chars
  • \s - one or more whitespaces
  • sessions? - session or sessions
  • \s of\s the - one or more whitespaces, of, one or more whitespaces, the
  • \s - one or more whitespaces
  • (?: - start of a non-capturing group:
    • Subsidiary\s Body\s for\s Implementation - Subsidiary one or more whitespaces Body one or more whitespaces for one or more whitespaces Implementation
    • | - or
    • Conference\s of\s the\s Parties - Conference one or more whitespaces of one or more whitespaces the one or more whitespaces Parties
    • | - or
    • subsidiary\s bodies - subsidiary one or more whitespaces bodies
  • ) - end of the group.
  • Related