Use regex to search for a phrase where the last word has a maximum number of characters-CodePudding

So I have a list of words below:

list = ['cyber attacks, 28','cyber attacks. A', 'cyber attacks, intrusions', 'cyber attacks; 3','cyber attack. Our', 'cyber intrusions, data']

What I want to do is to remove the phrases in the list if the third word has more than three characters. So the final list would be:

new_list = ['cyber attacks, 28','cyber attacks. A', 'cyber attacks; 3','cyber attack. Our']

This is what I have so far but it also includes the phrases where the last word is more than three characters:

new_list = []

for phrase in list:
    max_three_char = re.match('cyber\s\w{1,}(\.|,|;|\)|\/|:|"|])\s\w{,3}', phrase)
    if max_three_char:
        new_list.append(phrase)

CodePudding user response：

You could use a list comprehension as in

import re

lst = ['cyber attacks, 28','cyber attacks. A', 'cyber attacks, intrusions', 'cyber attacks; 3','cyber attack. Our', 'cyber intrusions, data']
pattern = re.compile(r'[, ] ')

new_lst = [item
           for item in lst
           for splitted in [pattern.split(item)]
           if not (len(splitted) > 2 and len(splitted[2]) > 3)]

print(new_lst)

Which would yield

['cyber attacks, 28', 'cyber attacks. A', 'cyber attacks; 3', 'cyber attack. Our']

Don't name your variables after built-in things like list, etc.

CodePudding user response：

No need for regex, you can use string.split:

if len(my_phrase.split()[2]) <=3:
    //process my_phrase

This works since there are spaces between the words.

CodePudding user response：

I would do:

import re 

li = ['cyber attacks, 28','cyber attacks. A', 'cyber attacks, intrusions', 'cyber attacks; 3','cyber attack. Our', 'cyber intrusions, data']

>>> [s for s in li if re.search(r'(?<=\W)\w{1,3}$', s)]
['cyber attacks, 28', 'cyber attacks. A', 'cyber attacks; 3', 'cyber attack. Our']

Or if you can count on have a space delimiter:

>>> [s for s in li if len(s.split()[-1])<=3]
# same

CodePudding user response：

Since your separator is space, you don't need regex and you can do with python standard method string.split().

ls = ['cyber attacks, 28', 'cyber attacks. A', 'cyber attacks, intrusions', 'cyber attacks; 3', 'cyber attack. Our',
      'cyber intrusions, data']

def my_filter(i) -> bool:
    sub_str = i.split(' ', 2)
    if len(sub_str) < 3:
        return False
    return len(sub_str[2]) <= 3

print([i for i in ls if my_filter(i)])