Split a string by regex and keep the seperator AS A PART OF ITEMS in python-CodePudding

I want to split a whatsapp chat backup text by date and keep the date as part of messages. I tried and couldn't achieve the exact result i want. If anyone can suggest me a way to achieve this, that would be a big help. (I don't know much about regex)

import re

chat = '27/01/2019, 08:58 - Member 01 created group "Python Lovers ❤️"
27/01/2019, 08:58 - You were added
19/03/2019, 19:29 - Member 02: Hello guys,,,
19/03/2019, 19:29 - Member 03: Hi there..'

regex = r"(\b\d /\d /\d .*?(?=\b\d /\d /\d |$)*)"
results = re.split(regex, chat)
print(results)

the above code does the job and keep the seperator as seperate item, but what i want it to be a part of its correponding message (item):

Current Result

['27/01/2019', 
'08:58 - You were added',
'19/03/2019', 
'19:29 - Member 02: Hello guys,,', 
'19/03/2019', 
'19:29 - Member 03: Hi there..']

WHAT I WANT

['27/01/2019, '08:58 - You were added',
'19/03/2019, '19:29 - Member 02: Hello guys,,', 
'19/03/2019, '19:29 - Member 03: Hi there..']

CodePudding user response：

That happened because you used re.split that keeps the chunks captured in the resulting list as separate items.

Your regex makes sense only if your matches can span several lines, else, extracting any line that starts with a time-like pattern would be enough.

That is why I'd suggest

regex = r"\b\d /\d /\d.*?(?=\s*\b\d /\d /\d |$)"
results = re.findall(regex, chat, re.S)

See the Python demo:

import re

chat = '''27/01/2019, 08:58 - Member 01 created group "Python Lovers ❤️"
27/01/2019, 08:58 - You were added
19/03/2019, 19:29 - Member 02: Hello guys,,,
19/03/2019, 19:29 - Member 03: Hi there..'''

regex = r"\b\d /\d /\d.*?(?=\s*\b\d /\d /\d |$)"
results = re.findall(regex, chat, re.S)
for r in results:
    print(r)

Output:

27/01/2019, 08:58 - Member 01 created group "Python Lovers ❤️"
27/01/2019, 08:58 - You were added
19/03/2019, 19:29 - Member 02: Hello guys,,,
19/03/2019, 19:29 - Member 03: Hi there..

Note the absence of the redundant capturing group and no * after the positive lookahead that made it optional. Whitespaces at the end of each match are stripped using \s* pattern inside the lookahead.

The re.S flag allows . to match any char including line break chars.

CodePudding user response：

Would you please try a Pypy regex solution:

import regex as re

chat = '''27/01/2019, 08:58 - Member 01 created group "Python Lovers ❤️"
27/01/2019, 08:58 - You were added
19/03/2019, 19:29 - Member 02: Hello guys,,,
19/03/2019, 19:29 - Member 03: Hi there..'''

pat = r'(?V1)\n*(?=\d{2}/\d{2}/\d{4})'
results = re.split(pat, chat)
print(results[1:])

Output:

['27/01/2019, 08:58 - Member 01 created group "Python Lovers \xe2\x9d\xa4\xef\xb8\x8f"', '27/01/2019, 08:58 - You were\nadded', '19/03/2019, 19:29 - Member 02: Hello guys,,,', '19/03/2019, 19:29 - Member 03: Hi there..']

(?V1) flag makes the zero-width matches work correctly.
The separator \n*(?=\d{2}/\d{2}/\d{4}) matches the date field keeping the matched string in the result.
results[1:] removes the empty item at the beginning of the list.