I came across a problem to split up words that are capitalized. I saw that some of them uses re.split() to split up non capitalized with capitalized words.
Example:
Input is:
>>> x = 'TheLongAndWindingRoad'
Yields out:
['The', 'Long', 'And', 'Winding', 'Road']
I came across a post using
>>> re.split('(?<=.)(?=[A-Z])', 'TheLongAndWindingRoad')
['The', 'Long', 'And', 'Winding', 'Road']
The code worked well for me and I was wondering, how did they come up with
'(?<=.)(?=[A-Z])'
CodePudding user response:
If it was me, it would be trial and error. Following the rules in Regular expression operations, I'd start by matching capital letters
>>> re.split(r"[A-Z]", x)
['', 'he', 'ong', 'nd', 'inding', 'oad']
But that's not right. I want to split right before the letter, so that means lookahead (?=...)
>>> re.split(r"(?=[A-Z])", x)
['', 'The', 'Long', 'And', 'Winding', 'Road']
But that's still not right. How to avoid the empty string at the start? Don't split if the first character is a capital letter... and than means a lookbehind (?<=...)
for any character.
>>> re.split(r"(?<=.)(?=[A-Z])", x)
['The', 'Long', 'And', 'Winding', 'Road']
And then I'd realize this only works for ASCII.
CodePudding user response:
?<=
is a lookbehind. ?=
is a lookahead.
So the string will be split at the empty character (a place between two characters) where it has anything in front (.
) and a capital letter afterwards ([A-Z]
).
It seems like (?=[A-Z])
would suffice (split in front of capital letters), but that will leave you with an empty string in front of The.