Home > Back-end >  splitting strings by list of separators irrespective of order
splitting strings by list of separators irrespective of order

Time:04-15

I have a string text and a list names

  • I want to split text every time an element of names occurs.

text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'

names = ['Mike', 'Monika']

desired output:

output = [['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]

FAQ

  • The order of the separators within names is indepentend of their occurance in text.
  • separators within names are unique but can occur multiple times throughout text. Therefore the output will have more lists than names has strings.
  • text will never have the same unique names element occuring twice consecutively/<>.
  • Ultimately I want the output to be a list of lists where each split text slice corresponds to its separator, that it was split by. Order of lists doesent matter.

re.split() wont let me use a list as a separator argument. Can I re.compile() my separator list?


help:

I think somebody has already had a similar problem here: https://stackoverflow.com/a/4697047/14648054

def split(txt, seps):
    default_sep = seps[0]
    for sep in seps[1:]: # skip seps[0] as the default separator
        txt = txt.replace(sep, default_sep)
    return [i.strip() for i in txt.split(default_sep)]

and here: https://stackoverflow.com/a/2911664/14648054

def my_split(s, seps):
    res = [s]
    for sep in seps:
        s, res = res, []
        for seq in s:
            res  = seq.split(sep)
    return res

print my_split('1111  2222 3333;4444,5555;6666', [' ', ';', ','])
['1111', '', '2222', '3333', '4444', '5555', '6666']

CodePudding user response:

Your example doesn't fully match your desired output. Also, it's not clear is the example input will always have this structure e.g. with the period at the end of each sentence.

Having said that, you might want to try this dirty approach:

import re

text = 'Monika will go shopping. Mike likes Pizza. Monika hates me.'

names = ['Ruth', 'Mike', 'Monika']
rsplit = re.compile("|".join(sorted(names))).split

output = []
sentences = text.split(".")
for name in names:
    for sentence in sentences:
        if name in sentence:
            output.append([name, f"{rsplit(sentence)[-1]}."])

print(output)

This outputs:

[['Mike', ' likes Pizza.'], ['Monika', ' will go shopping.'], ['Monika', ' hates me.']]

CodePudding user response:

If you are looking for a way to use regular expressions, then:

import re

def do_split(text, names):
    joined_names = '|'.join(re.escape(name) for name in names)

    regex1 = re.compile('(?='   joined_names   ')')
    strings = filter(lambda s: s != '', regex1.split(text))

    regex2 = re.compile('('   joined_names   ')')
    return [list(filter(lambda s: s != '', regex2.split(s))) for s in strings]

text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))

Prints:

[['Monika', ' goes shopping. Then she rides bike. '], ['Mike', ' likes Pizza. '], ['Monika', ' hates me.']]

Explanation

First we dynamically create a regex regex1 from the past names argument to be:

(?=Mike|Monika)

When you split the input on this you, because any of the passed names may appear at the beginning or end of the input, you could end up with empty strings in the result and so we will filter those out and get:

['Monika goes shopping. Then she rides bike. ', 'Mike likes Pizza. ', 'Monika hates me.']

Then we split each list on:

(Mike|Monika)

And again we filter out any possible empty strings to get our final result.

The key to all of this is that when our regex on which we split contains a capture group, the text of that capture group is also returned as part of the resulting list.

  • Related