Identify the names of the interlocutors (most frequent words that follow a certain pattern) to separ-CodePudding

import re

#To read input data file
with open("dm_chat_data.txt") as input_data_file:
    print(input_data_file.read())

#To write corrections in a new text file
with open('dm_chat_data_fixed.txt', 'w') as file:
    file.write('\n')

This is the text file extracted by webscraping, but the lines of the dialogs of each of its chat partners are not separated, so the program must identify when each user starts the dialog. File dm_chat_data.txt

Desempleada_19: HolaaLucyGirl: hola como estas?Desempleada_19: Masomenos y vos LucyGirl?Desempleada_19: Q edad tenes LucyGirl: tengo 19LucyGirl: masomenos? que paso? (si se puede preguntar claro)Desempleada_19: Yo tmb 19 me llamo PriscilaDesempleada_19: Desempleada_19: Q hacías LucyGirl: Entre al chat para ver que onda, no frecuento mucho
Charge file [100%] (ddddfdfdfd)
LucyGirl: Desempleada_19: Gracias!

AndrewSC: HolaAndrewSC: Si quieres podemos hablar LyraStar: claro LyraStar: que cuentas amigaAndrewSC: Todo bien y tú?
Charge file [100%] (ddddfdfdfd)
LyraStar: LyraStar: que tal ese auto?AndrewSC: Creo que...Diria que... ya son las 19 : 00 hs AndrewSC: Muy bien la verdad

Bsco_Pra_Cap_: HolaBsco_Pra_Cap_: como vaBsco_Pra_Cap_: Jorge, 47, de Floresta, me presento a la entrevista, vos?Bsco_Pra_Cap_: es aqui, cierto?LucyFlame: holaaLucyFlame: estas?LucyFlame: soy una programadora de la ciudad de HudsonBsco_Pra_Cap_: de Hudson centro? o hudson alejado...?Bsco_Pra_Cap_: contame, Lu, que buscas en esta organizacion?

And this is the file that you must create separating the dialogues of each interlocutor in each of the chats. The file edited_dm_chat_data.txt need to be like this...

Desempleada_19: Holaa
LucyGirl: hola como estas?
Desempleada_19: Masomenos y vos LucyGirl?
Desempleada_19: Q edad tenes 
LucyGirl: tengo 19
LucyGirl: masomenos? que paso? (si se puede preguntar claro)
Desempleada_19: Yo tmb 19 me llamo Priscila
Desempleada_19: 
Desempleada_19: Q hacías 
LucyGirl: Entre al chat para ver que onda, no frecuento mucho
Charge file [100%] (ddddfdfdfd)
LucyGirl: 
Desempleada_19: Gracias!


AndrewSC: Hola
AndrewSC: Si quieres podemos hablar 
LyraStar: claro 
LyraStar: que cuentas amiga
AndrewSC: Todo bien y tú?
Charge file [100%] (ddddfdfdfd)
LyraStar: LyraStar: que tal ese auto?
AndrewSC: Creo que...Diria que... ya son las 19 : 00 hs 
AndrewSC: Muy bien la verdad

Bsco_Pra_Cap_: Hola
Bsco_Pra_Cap_: como va
Bsco_Pra_Cap_: Jorge, 47, de Floresta, me presento a la entrevista, vos?Bsco_Pra_Cap_: es aqui, cierto?
LucyFlame: holaa
LucyFlame: estas?
LucyFlame: soy una programadora de la ciudad de Hudson
Bsco_Pra_Cap_: de Hudson centro? o hudson alejado...?
Bsco_Pra_Cap_: contame, Lu, que buscas en esta organizacion?

I have tried to use regex, where each interlocutor is represented by a "Word" that begins in uppercase immediately followed by ": "

But there are some lines that give some problems to this logic, for example "Bsco_Pra_Cap_: HolaBsco_Pra_Cap_: como va", where the substring "Hola" is a simply word that is not a name and is attached to the name with capital letters, then it would be confused and consider "HolaBsco_Pra_Cap_: " as a name, but it's incorrect because the correct users name is "Bsco_Pra_Cap_: "

This problem arises because we don't know what the nicknames of the interlocutor users will be, and... the only thing we know is the structure where they start with a capital letter and end in : and then an empty space, but one thing I've noticed is that in all chats the names of the conversation partners are the most repeated words, so I think I could use a regular expression pattern as a word frequency counter by setting a search criteria like this "[INITIAL CAPITAL LETTER] hjasahshjas: " , and put as line separators those substrings with these characteristics as long as they are the ones that are repeated the most throughout the file

input_data_file = open("dm_chat_data.txt", "r ")
#maybe you can use something like this to count the occurrences and thus identify the nicknames
input_data_file.count(r"[A-Z][^A-Z]*:\s")

CodePudding user response：

I think it is quite hard. but you can build a rules as shown in below code:

import nltk
from collections import Counter

text = '''Desempleada_19: HolaaLucyGirl: hola como estas?Desempleada_19: 
Masomenos y vos LucyGirl?Desempleada_19: Q edad tenes LucyGirl: tengo 
19LucyGirl: masomenos? que paso? (si se puede preguntar claro)Desempleada_19: Yo 
tmb 19 me llamo PriscilaDesempleada_19: Desempleada_19: Q hacías LucyGirl: Entre 
al chat para ver que onda, no frecuento mucho
Charge file [100%] (ddddfdfdfd)
LucyGirl: Desempleada_19: Gracias!

AndrewSC: HolaAndrewSC: Si quieres podemos hablar LyraStar: claro LyraStar: que 
cuentas amigaAndrewSC: Todo bien y tú?
Charge file [100%] (ddddfdfdfd)
LyraStar: LyraStar: que tal ese auto?AndrewSC: Creo que...Diria que... ya son 
las 19 : 00 hs AndrewSC: Muy bien la verdad

Bsco_Pra_Cap_: HolaBsco_Pra_Cap_: como vaBsco_Pra_Cap_: Jorge, 47, de Floresta, 
me presento a la entrevista, vos?Bsco_Pra_Cap_: es aqui, cierto?LucyFlame: 
holaaLucyFlame: estas?LucyFlame: soy una programadora de la ciudad de 
HudsonBsco_Pra_Cap_: de Hudson centro? o hudson alejado...?Bsco_Pra_Cap_: 
contame, Lu, que buscas en esta organizacion?
'''
data = nltk.word_tokenize(text)

user_lst = []
for ind, val in enumerate(data):
    if val == ':':
        user_lst.append(data[ind - 1])

# printing duplicates assuming the users were speaking more than one time. if a 
user has one dialog box it fails.
users = [k for k, v in Counter(user_lst).items() if v > 1]


# function to replace a string:
def replacer(string, lst):
    for v in lst:
        string = string.replace(v, f' {v}')
    return string

# replace users in old text with single space in it.
refined_text = replacer(text, users)
refined_data = nltk.word_tokenize(refined_text)

correct_users = []
dialog = []
for ind, val in enumerate(refined_data):
    if val == ':':
        correct_users.append(refined_data[ind - 1])
    if val not in users:
        dialog.append(val)

correct_dialog = ' '.join(dialog).replace(':', '<:').split('<')
strip_dialog = [i.strip() for i in correct_dialog if i.strip()]

chat = []
for i in range(len(correct_users)):
    chat.append(f'{correct_users[i]}{strip_dialog[i]}')
print(chat)

>>>> ['Desempleada_19: Holaa', 'LucyGirl: hola como estas ?', 'Desempleada_19: Masomenos y vos ?', 'Desempleada_19: Q edad tenes', 'LucyGirl: tengo 19', 'LucyGirl: masomenos ? que paso ? ( si se puede preguntar claro )', 'Desempleada_19: Yo tmb 19 me llamo Priscila', 'Desempleada_19:', 'Desempleada_19: Q hacías', 'LucyGirl: Entre al chat para ver que onda , no frecuento mucho Charge file [ 100 % ] ( ddddfdfdfd )', 'LucyGirl:', 'Desempleada_19: Gracias !', 'AndrewSC: Hola', 'AndrewSC: Si quieres podemos hablar', 'LyraStar: claro', 'LyraStar: que cuentas amiga', 'AndrewSC: Todo bien y tú ? Charge file [ 100 % ] ( ddddfdfdfd )', 'LyraStar:', 'LyraStar: que tal ese auto ?', 'AndrewSC: Creo que ... Diria que ... ya son las 19', '19: 00 hs', 'AndrewSC: Muy bien la verdad', 'Bsco_Pra_Cap_: Hola', 'Bsco_Pra_Cap_: como va', 'Bsco_Pra_Cap_: Jorge , 47 , de Floresta , me presento a la entrevista , vos ?', 'Bsco_Pra_Cap_: es aqui , cierto ?', 'LucyFlame: holaa', 'LucyFlame: estas ?', 'LucyFlame: soy una programadora de la ciudad de Hudson', 'Bsco_Pra_Cap_: de Hudson centro ? o hudson alejado ... ?', 'Bsco_Pra_Cap_: contame , Lu , que buscas en esta organizacion ?']