Home > Mobile >  Remove from a list of strings, those strings that have only empty spaces or that are made up of less
Remove from a list of strings, those strings that have only empty spaces or that are made up of less

Time:01-19

import re

sentences_list = ['Hay 5 objetos rojos sobre la mesada de ahí.', 'Debajo de la mesada hay 4 objetos', '', '     ', "\taa!", '\t\n \n', '\n ', 'ai\n ', 'Salto rapidamente!!!', 'y la vio volar', '!', '   aa', 'aa', 'día']

#The problem with this is that there are several cases that need to be eliminated 
# and the complexity to figure that out should be resolved with a regex.
sentences_list = [i for a,i in enumerate(sentences_list) if i != ' ']

print(repr(sentences_list)) #print the already filtered list to verify

I got these strings with a sentence separator, the problem is that some sentences aren't really sentences or aren't really linguistically significant units.

  • Those strings that have less than 3 alphanumeric characters (that is, 2 characters or less) must be eliminated from the list.

  • Those strings that are empty "" or " " , or that are made up of single symbols "...!", ";", ".\n", "\taa!" must be eliminated from the list.

  • Those strings that have only escape characters and nothing else, except symbols or that have less than 3 alphanumeric characters, for example "\t\n ab ." , "\n .", "\n" must be eliminated from the list.

This is how the correct list should look after having filtered those elements that are substrings that do not meet the conditions

['Hay 5 objetos rojos sobre la mesada de ahí.', 'Debajo de la mesada hay 4 objetos', 'Salto rapidamente!!!', 'y la vio volar', 'día']

CodePudding user response:

You can count the number of alphanumeric characters in a string by calling .isalnum() on each character and summing these values. Then you can keep only the strings that have at least 3 of these with a list comprehension.

sentences_list = [s for s in sentences_list if sum(c.isalnum() for c in s) >= 3]

CodePudding user response:

You could use this if clause:

if len(re.sub(r"\W", "", i)) >= 3

CodePudding user response:

Here is a simple solution to filter out what you need:

from string import punctuation

for sentence in sentences_list:
    trimmed_sentence = sentence.strip().strip(punctuation)
    if len (trimmed_sentence) > 2:
        print (sentence)

Basically it goes through all the sentences from the list, trims the sentences (space, new lines), trims the punctuation marks and then checks the length of the sentence to be more than 2.

Also learn more about strip() - here.

  • Related