Home > database >  How does the regex pattern '(?<=.)(?=[A-Z])' work?
How does the regex pattern '(?<=.)(?=[A-Z])' work?

Time:11-24

I came across a problem to split up words that are capitalized. I saw that some of them uses re.split() to split up non capitalized with capitalized words.

Example:

Input is:

>>> x = 'TheLongAndWindingRoad' 

Yields out:

['The', 'Long', 'And', 'Winding', 'Road']

I came across a post using

>>> re.split('(?<=.)(?=[A-Z])', 'TheLongAndWindingRoad')
['The', 'Long', 'And', 'Winding', 'Road']

The code worked well for me and I was wondering, how did they come up with

'(?<=.)(?=[A-Z])' 

CodePudding user response:

If it was me, it would be trial and error. Following the rules in Regular expression operations, I'd start by matching capital letters

>>> re.split(r"[A-Z]", x)
['', 'he', 'ong', 'nd', 'inding', 'oad']

But that's not right. I want to split right before the letter, so that means lookahead (?=...)

>>> re.split(r"(?=[A-Z])", x)
['', 'The', 'Long', 'And', 'Winding', 'Road']

But that's still not right. How to avoid the empty string at the start? Don't split if the first character is a capital letter... and than means a lookbehind (?<=...) for any character.

>>> re.split(r"(?<=.)(?=[A-Z])", x)
['The', 'Long', 'And', 'Winding', 'Road']

And then I'd realize this only works for ASCII.

CodePudding user response:

?<= is a lookbehind. ?= is a lookahead.

So the string will be split at the empty character (a place between two characters) where it has anything in front (.) and a capital letter afterwards ([A-Z]).

It seems like (?=[A-Z]) would suffice (split in front of capital letters), but that will leave you with an empty string in front of The.

  • Related