I am using the following code to join the 2 letter abbreviation, the first name and last name- so that every 3 words become a phrase. However there are cases where the name contains a "Jr" as in the string sample below, therefore breaking this otherwise working list comprehension.
span = 3
#string sample
words = words = ['QB', 'Teddy', 'Bridgewater','RB', 'Dalvin', 'Cook', 'WR', 'Keenan', 'Allen', 'TE', 'Dalton', 'Schultz', 'WR', 'Odell', 'Beckham', 'Jr']
item= [" ".join(words[i:i span]) for i in range(0, len(words), span)]
Is there a way to conditionally join to a span of 4 when "Jr." is the 4th word in this case? What I currently get:
['QB Teddy Bridgewater', 'RB Dalvin Cook', 'WR Keenan Allen', 'TE Dalton Schultz', 'WR Odell Beckham', 'Jr']
But expected output should be:
['QB Teddy Bridgewater', 'RB Dalvin Cook', 'WR Keenan Allen', 'TE Dalton Schultz', 'WR Odell Beckham Jr']
CodePudding user response:
One solution can be using of regular expressions:
import re
print(list(re.findall(r"([A-Z]{2}. ?)\s*(?=[A-Z]{2}|\Z)", " ".join(words))))
Prints:
['QB Teddy Bridgewater', 'RB Dalvin Cook', 'WR Keenan Allen', 'TE Dalton Schultz', 'WR Odell Beckham Jr']
CodePudding user response:
Assuming that each group starts with an uppercase abbreviation, you can join all the strings together, prefixing the abbreviations with an end of line and others with a space. Then split the resulting string on end of lines and drop the first (empty) entry.
words = ['QB', 'Teddy', 'Bridgewater','RB', 'Dalvin', 'Cook', 'WR', 'Keenan', 'Allen', 'TE', 'Dalton', 'Schultz', 'WR', 'Odell', 'Beckham', 'Jr']
items = "".join(" \n"[s[:2]==s.upper()] s for s in words).split("\n")[1:]
print(items)
['QB Teddy Bridgewater', 'RB Dalvin Cook', 'WR Keenan Allen', 'TE Dalton Schultz', 'WR Odell Beckham Jr']
If the list of abbreviations is known, it would be better to replace [s[:2]==s.upper()]
with membership in a set: [s in {'QB','RB','WR','TE'}]
(you should hold the set in a separate variable though)
If you don't mind using a library, you can do the same thing more concisely using a regular expression for the substitution of spaces with end of lines:
items = re.sub(r" (?=[A-Z]{2} )","\n"," ".join(words)).split("\n")
This may be somewhat unreliable given that any name containing a two letter uppercase word would cause an inappropriate split (e.g. ["RB","OJ", "Simpson"]
). With a known list of abbreviations, this can be avoided by placing them in the pattern:
items = re.sub(r" (?=(QB|RB|WR|TE) )","\n"," ".join(words)).split("\n")