Joining uneven strings conditionally using list comprehension-CodePudding

I am using the following code to join the 2 letter abbreviation, the first name and last name- so that every 3 words become a phrase. However there are cases where the name contains a "Jr" as in the string sample below, therefore breaking this otherwise working list comprehension.

span = 3
#string sample 
words = words = ['QB', 'Teddy', 'Bridgewater','RB', 'Dalvin', 'Cook', 'WR', 'Keenan', 'Allen', 'TE', 'Dalton', 'Schultz', 'WR', 'Odell', 'Beckham', 'Jr']
item= [" ".join(words[i:i span]) for i in range(0, len(words), span)]

Is there a way to conditionally join to a span of 4 when "Jr." is the 4th word in this case? What I currently get:

['QB Teddy Bridgewater', 'RB Dalvin Cook', 'WR Keenan Allen', 'TE Dalton Schultz', 'WR Odell Beckham', 'Jr']

But expected output should be:

['QB Teddy Bridgewater', 'RB Dalvin Cook', 'WR Keenan Allen', 'TE Dalton Schultz', 'WR Odell Beckham Jr']

CodePudding user response：

One solution can be using of regular expressions:

import re

print(list(re.findall(r"([A-Z]{2}. ?)\s*(?=[A-Z]{2}|\Z)", " ".join(words))))

Prints:

['QB Teddy Bridgewater', 'RB Dalvin Cook', 'WR Keenan Allen', 'TE Dalton Schultz', 'WR Odell Beckham Jr']

CodePudding user response：

Assuming that each group starts with an uppercase abbreviation, you can join all the strings together, prefixing the abbreviations with an end of line and others with a space. Then split the resulting string on end of lines and drop the first (empty) entry.

words = ['QB', 'Teddy', 'Bridgewater','RB', 'Dalvin', 'Cook', 'WR', 'Keenan', 'Allen', 'TE', 'Dalton', 'Schultz', 'WR', 'Odell', 'Beckham', 'Jr']

items = "".join(" \n"[s[:2]==s.upper()] s for s in words).split("\n")[1:]

print(items)
['QB Teddy Bridgewater', 'RB Dalvin Cook', 'WR Keenan Allen', 'TE Dalton Schultz', 'WR Odell Beckham Jr']

If the list of abbreviations is known, it would be better to replace [s[:2]==s.upper()] with membership in a set: [s in {'QB','RB','WR','TE'}] (you should hold the set in a separate variable though)

If you don't mind using a library, you can do the same thing more concisely using a regular expression for the substitution of spaces with end of lines:

items = re.sub(r" (?=[A-Z]{2} )","\n"," ".join(words)).split("\n")

This may be somewhat unreliable given that any name containing a two letter uppercase word would cause an inappropriate split (e.g. ["RB","OJ", "Simpson"]). With a known list of abbreviations, this can be avoided by placing them in the pattern:

items = re.sub(r" (?=(QB|RB|WR|TE) )","\n"," ".join(words)).split("\n")