Home > Enterprise >  Split a string and capture all instances in python regex
Split a string and capture all instances in python regex

Time:07-07

Newbie here, I have been trying to learn regex for some time but sometimes I feel I can't understand how regex is handling strings. Because in planning phase I seem to work it out, but in implementation it doesn't work as I expect it.

Here is my little problem: I have strings that contains one or more names (team names). The problem is that if the string contains more than one, there is no separator. All names are joint directly.

Some examples :

------------String -----------------Contains----------Names to be extracted

  • 'RangersIslandersDevils' --> 3 names ->>> [Rangers, Islanders, Devils]
  • '49ersRaiders' -------------> 2 names ->>> [49ers, Raiders]
  • 'Avalanche'----------------> 1 name ->>> [Avalanche]
  • 'Red Wings'---------------> 1 name ->>> [Red Wings]

I want to capture each name in each string and use them in a loop later on. But I can't seem to implement the pattern I imagine for it.

The pattern implementation in my head for the strings are like this:

  1. Start scanning the text which is expected to start with a capital letter or number
  2. If you see a literal 's' followed by a capital letter (like ...s[A-Z]..) capture the text until "s" (including s)
  3. Repeat step two until you no more see (....s[A-Z]..) pattern. And capture the rest of the string as the last name.
  4. Optionally, Write all names in a list

Well I tried in vain some code in which the step two captures only one instance and step 3 normally gives another.

re.findall('([A-Z0-9].*s)*([A-Z].*) ', 'RangersIslandersMolsDevil')

That returns only two names:

[('RangersIslandersMols', 'Devil')]

whereas I want four:

[Rangers, Islanders, Mols, Devil]

CodePudding user response:

([A-Z0-9].*s)* will capture as many of any character as it can, so that's causing 'RangersIslandersMols' to get stuck together as one match.

It sounds like the boundary between team names is defined as a lowercase letter (not necessarily an 's', as in 'Avalanche') followed immediately by an uppercase letter or number, so our pattern should look for:

  • uppercase letters or numbers, followed by
  • lowercase letters

Because a team name can have multiple words, we'll also look for a space followed by the same pattern as above, for any possible number of words.

Try this pattern:

>>> pattern = r'[A-Z0-9] [a-z] (?: [A-Z0-9] [a-z] )*'
>>> findall(pattern, "RangersIslandersDevils49ersWashginton Football TeamAvalancheWarriors")
['Rangers', 'Islanders', 'Devils', '49ers', 'Washginton Football Team', 'Avalanche', 'Warriors']

CodePudding user response:

I think this is being over complicated, why not just try this approach which is to split the string into words that do not end in a capital letter.

import re

test = [
    'RangersIslandersDevils',
    '49ersRaiders',
    'Avalanche',
    'Red Wings',
    'RangersIslandersMolsDevil'
]

for word in test:
    print(re.findall('.[^A-Z]*', word))

['Rangers', 'Islanders', 'Devils']
['49ers', 'Raiders']
['Avalanche']
['Red ', 'Wings']
['Rangers', 'Islanders', 'Mols', 'Devil']
  • Related