I have a list of strings names
names = ['acquaintance Muller', 'Vice president Johnson affiliate Peterson acquaintance Dr. Rose']
I want to split the strings that contain more than one of the following substrings:
substrings = ['Vice president', 'affiliate', 'acquaintance']
More precicely, i want to split after the last character of the word that follows the substring
desired_output = ['acquaintance Muller', 'Vice president Johnson', 'affiliate Peterson', 'acquaintance Dr. Rose']
I dont know how to implement 'more than one' condition into my code:
names = ['acquaintance Muller', 'Vice president Johnson affiliate Peterson acquaintance Dr. Rose']
substrings = re.compile(r'Vice\spresident|affiliate|acquaintance')
splitted = []
for i in names:
if substrings in i:
splitted.append([])
splitted[-1].append(item)
Exception: when that last character is a point (e.g. Prof.
), split after the second word following the substring.
list comprehension:
[x for x in names if 'Vice\spresident' in x or 'affiliate' in x or 'acquaintance' in x]
CodePudding user response:
Try:
import re
names = [
"acquaintance Muller",
"Vice president Johnson affiliate Peterson acquaintance Dr. Rose",
]
substrings = ["Vice president", "affiliate", "acquaintance"]
r = re.compile("|".join(map(re.escape, substrings)))
out = []
for n in names:
starts = [i.start() for i in r.finditer(n)]
if not starts:
out.append(n)
continue
if starts[0] != 0:
starts = [0, *starts]
starts.append(len(n))
for a, b in zip(starts, starts[1::]):
out.append(n[a:b])
print(out)
Prints:
['acquaintance Muller', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']
CodePudding user response:
You want to split at the word boundary just before one of those three titles, so you can look for a word boundary \b
followed by a positive lookahead (?=...)
for one of those titles:
>>> s = 'Vice president Johnson affiliate Peterson acquaintance Dr. Rose'
>>> v = re.split(r"\b(?=Vice president|affiliate|acquaintance)", s)
['', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']
Then, you can trim and discard the empty results:
>>> v = [x for i in v if (x := i.strip())]
['Vice president Johnson', 'affiliate Peterson', 'acquaintance Dr. Rose']