Home > Software engineering >  Remove 2 or more consecutive non-caps words from strings stored within a list of strings using regex
Remove 2 or more consecutive non-caps words from strings stored within a list of strings using regex

Time:01-18

import re

list_with_penson_names_in_this_input = ["María Sol", "María del Carmen Perez Agüiño", "Melina Saez Sossa", "el de juego es Alex" , "Harry ddeh jsdelasd Maltus ", "Ben White ddesddsh jsdelasd Regina yáshas asdelas Javier Ruben Rojas", "Robert", 'Melina presento el nuevo presupuesto', "presento el nuevo presupuesto, María del Carmén "]

aux_list = []

for i in list_with_penson_names_in_this_input:
    list_with_penson_names_in_this_input.remove(i)
    aux_list = re.sub(, , i)
    list_with_penson_names_in_this_input = list_with_penson_names_in_this_input   aux_list
    aux_list = []
    

print(list_with_penson_names_in_this_input) #print fixed name list

If there are more than 2 words ((?:\w\s*) ) that do not start with a capital letter in a row, and that are not connectors of the type [del|de el|de] then it should be eliminated from the first name, in capital letter, detected, and have the second name detected (if it exists) place it as a separate element within the list. for example,

["Harry ddeh jsdelasd Maltus "] --> ["Harry", "Maltus"]

["Ben White ddesddsh jsdelasd Regina yáshas asdelas Javier Ruben Rojas"] --> ["Ben White", "Regina", "Javier Ruben Rojas"]

and if there is not more than one name, you should remove if there are 2 or more consecutive words that do not start with a capital letter, and that are not connectors of the type [del|de el|de]

["Melina Martinez presento el nuevo presupuesto"] --> ["Melina Martinez"]

["Melina presento el nuevo presupuesto"] --> ["Melina"]

["presento el nuevo presupuesto, María del Carmén "] --> ["María del Carmén"]

When it comes to fixing those elements that do not meet the specifications, the elements on this list should be:

["María Sol", "María del Carmen Perez Agüiño", "Melina Saez Sossa", "Alex" , "Harry", "Maltus", "Ben White", "Regina", "Javier Ruben Rojas", "Robert", 'Melina', "María del Carmén"]

For the words in between that don't start with a capital letter i tried something like this ((?:\w\s*) ) , but even so it does not restrict the presence or not of words with a capital letter

CodePudding user response:

Match the words that starts with lowercase with continuous two or more and use re.split to split the matched words.

import re

name_lst = ["María Sol", "María del Carmen Perez Agüiño", "Melina Saez Sossa",
            "el de juego es Alex", "Harry ddeh jsdelasd Maltus ",
            "Ben White ddesddsh jsdelasd Regina yáshas asdelas Javier Ruben Rojas",
            "Robert", 'Melina presento el nuevo presupuesto',
            "presento el nuevo presupuesto, María del Carmén "]

cleaned_names = []
for i in name_lst:
    data = re.split(r"(?:\b[a-z]\S \s){2,}(?:[a-z] $)?", i)
    cleaned_names.extend(data)

output = [i.strip() for i in cleaned_names if i]
print(output)

>>> ['María Sol', 'María del Carmen Perez Agüiño', 'Melina Saez Sossa', 'Alex', 'Harry', 'Maltus', 'Ben White', 'Regina', 'Javier Ruben Rojas', 'Robert', 'Melina', 'María del Carmén']  
  • Related