Can't find regex pattern to extract authors and messages of WhatsApp chat history-CodePudding

I am currently working on a WhatsApp chat analyser and I have been trying to figure out the pattern for the authors and messages of the chat history but I am not successful.

I have a sample of the chat which looks like this:

08.03.22, 20:55 - Laura: Ja klingt gut :)
08.03.22, 21:00 - Anil: Wunderbar :)

What is the pattern, that could extract Laura , Anil into one list and Ja klingt gut :) , Wunderbar :) into another. For the dates and times I already found the pattern.

Thanks in advance.

CodePudding user response：

Using regex below, you can find the appropriate solution to your problem:

\s(?P<fname>[a-zA-Z] )\:\s(?P<sname>[a-zA-Z\s] \s:\))

To have a better understanding of what this regex does, you can take a look at this link

To gain more insight into what you need to do in python, you can take look at code below and its output:

import re
pattern = r"\s(?P<fname>[a-zA-Z] )\:\s(?P<sname>[a-zA-Z\s] \s:\))"
string = """
08.03.22, 20:55 - Laura: Ja klingt gut :)
08.03.22, 21:00 - Anil: Wunderbar :)
""".strip()
re.findall(pattern,string)

Output

[('Laura', 'Ja klingt gut :)'), ('Anil', 'Wunderbar :)')]

The first value of each element in the output list is the first name (selected as fname in the regex group`) and the second one is the rest you need.

CodePudding user response：

If your messages are in a list.

v = ["08.03.22, 20:55 - Laura: Ja klingt gut :)",
     "08.03.22, 21:00 - Anil: Wunderbar :)"]

import re
pattern = re.compile(r'- ([A-Za-z]*):([A-Za-z :)]*)')
names = [pattern.findall(x)[0][0] for x in v] 
messages = [pattern.findall(x)[0][1] for x in v]

You can try the above code.

CodePudding user response：

For your examples you can use (\d\d.\d\d.\d\d, \d\d:\d\d) - ([^:]):(.)

CodePudding user response：

Considered using split() ?

Assuming usernames can't contain ":" symbols :

my_string="Laura: Ja klingt gut :)"
a = my_string.split(':', 1)
username = a[0]
message = a[1][1:]
print(username)
print(message)

Returns :

Laura
Ja klingt gut :)

CodePudding user response：

Of course you can use regex, but keep in mind that they are significantly slower than plain split or partition operations. Regex are using much memory as well. If your chat history isn't very big, just use one of the solutions given earlier. But consider faster and much simpler solution.

As we can see, a line structure is as follows: time followed by a dash character, then name, colon and message. It's enough to remove time and then partition following part on first occurence of a colon.

To extract name and text from each line, you need to simply do this:

name, _, text = line.split('- ')[1].partition(': ')

As a whole, it may look like this:

messages_list = [
    '08.03.22, 20:55 - Laura: Ja klingt gut :)',
    '08.03.22, 21:00 - Anil: Wunderbar :)'
]

output = [line.split('- ')[1].partition(': ') for line in messages_list]
names = [item[0] for item in output]
messages = [item[-1] for item in output]