split strings that contain more than one substring-CodePudding

I have a list of strings names

names = ['acquaintance Muller', 'Vice president Johnson affiliate Peterson acquaintance Dr. Rose']

I want to split the strings that contain more than one of the following substrings:

substrings = ['Vice president', 'affiliate', 'acquaintance']

More precicely, i want to split after the last character of the word that follows the substring

desired_output = ['acquaintance Muller', 'Vice president Johnson', 'affiliate Peterson', 'acquaintance Dr. Rose']

I dont know how to implement 'more than one' condition into my code:

names = ['acquaintance Muller', 'Vice president Johnson affiliate Peterson acquaintance Dr. Rose']
substrings = re.compile(r'Vice\spresident|affiliate|acquaintance')
    splitted = []
    for i in names:
        if substrings in i:
            splitted.append([])
        splitted[-1].append(item)

Exception: when that last character is a point (e.g. Prof.), split after the second word following the substring.

list comprehension:

[x for x in names if 'Vice\spresident' in x or 'affiliate' in x or 'acquaintance' in x]

CodePudding user response：

Try:

import re

names = [
    "acquaintance Muller",
    "Vice president Johnson affiliate Peterson acquaintance Dr. Rose",
]
substrings = ["Vice president", "affiliate", "acquaintance"]

r = re.compile("|".join(map(re.escape, substrings)))

out = []
for n in names:
    starts = [i.start() for i in r.finditer(n)]

    if not starts:
        out.append(n)
        continue

    if starts[0] != 0:
        starts = [0, *starts]

    starts.append(len(n))
    for a, b in zip(starts, starts[1::]):
        out.append(n[a:b])

print(out)

Prints:

['acquaintance Muller', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']

CodePudding user response：

You want to split at the word boundary just before one of those three titles, so you can look for a word boundary \b followed by a positive lookahead (?=...) for one of those titles:

>>> s = 'Vice president Johnson affiliate Peterson acquaintance Dr. Rose'
>>> v = re.split(r"\b(?=Vice president|affiliate|acquaintance)", s)
    ['', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']

Then, you can trim and discard the empty results:

>>> v = [x for i in v if (x := i.strip())]
    ['Vice president Johnson', 'affiliate Peterson', 'acquaintance Dr. Rose']