I have a string text
and a list names
- I want to split
text
every time an element ofnames
occurs.
text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
desired output:
output = [['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]
FAQ
- The order of the separators within
names
is indepentend of their occurance intext
. - separators within
names
are unique but can occur multiple times throughouttext
. Therefore the output will have more lists thannames
has strings. text
will never have the same uniquenames
element occuring twice consecutively/<>.- Ultimately I want the output to be a list of lists where each split
text
slice corresponds to its separator, that it was split by. Order of lists doesent matter.
re.split()
wont let me use a list as a separator argument. Can I re.compile()
my separator list?
help:
I think somebody has already had a similar problem here: https://stackoverflow.com/a/4697047/14648054
def split(txt, seps): default_sep = seps[0] for sep in seps[1:]: # skip seps[0] as the default separator txt = txt.replace(sep, default_sep) return [i.strip() for i in txt.split(default_sep)]
and here: https://stackoverflow.com/a/2911664/14648054
def my_split(s, seps): res = [s] for sep in seps: s, res = res, [] for seq in s: res = seq.split(sep) return res print my_split('1111 2222 3333;4444,5555;6666', [' ', ';', ',']) ['1111', '', '2222', '3333', '4444', '5555', '6666']
CodePudding user response:
Your example doesn't fully match your desired output. Also, it's not clear is the example input will always have this structure e.g. with the period at the end of each sentence.
Having said that, you might want to try this dirty approach:
import re
text = 'Monika will go shopping. Mike likes Pizza. Monika hates me.'
names = ['Ruth', 'Mike', 'Monika']
rsplit = re.compile("|".join(sorted(names))).split
output = []
sentences = text.split(".")
for name in names:
for sentence in sentences:
if name in sentence:
output.append([name, f"{rsplit(sentence)[-1]}."])
print(output)
This outputs:
[['Mike', ' likes Pizza.'], ['Monika', ' will go shopping.'], ['Monika', ' hates me.']]
CodePudding user response:
If you are looking for a way to use regular expressions, then:
import re
def do_split(text, names):
joined_names = '|'.join(re.escape(name) for name in names)
regex1 = re.compile('(?=' joined_names ')')
strings = filter(lambda s: s != '', regex1.split(text))
regex2 = re.compile('(' joined_names ')')
return [list(filter(lambda s: s != '', regex2.split(s))) for s in strings]
text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))
Prints:
[['Monika', ' goes shopping. Then she rides bike. '], ['Mike', ' likes Pizza. '], ['Monika', ' hates me.']]
Explanation
First we dynamically create a regex regex1
from the past names argument to be:
(?=Mike|Monika)
When you split the input on this you, because any of the passed names may appear at the beginning or end of the input, you could end up with empty strings in the result and so we will filter those out and get:
['Monika goes shopping. Then she rides bike. ', 'Mike likes Pizza. ', 'Monika hates me.']
Then we split each list on:
(Mike|Monika)
And again we filter out any possible empty strings to get our final result.
The key to all of this is that when our regex on which we split contains a capture group, the text of that capture group is also returned as part of the resulting list.